Maximizing Data Efficiency: Benefits of Global Deduplication

Similar documents
Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

White Paper Simplified Backup and Reliable Recovery

Scale-Out Architectures for Secondary Storage

Dell Storage Point of View: Optimize your data everywhere

EMC DATA DOMAIN PRODUCT OvERvIEW

Scale-out Data Deduplication Architecture

Veritas NetBackup Appliance Family OVERVIEW BROCHURE

Symantec NetBackup 7 for VMware

Milestone Solution Partner IT Infrastructure Components Certification Report

EMC DATA DOMAIN OPERATING SYSTEM

HOW DATA DEDUPLICATION WORKS A WHITE PAPER

NEC Express5800 R320f Fault Tolerant Servers & NEC ExpressCluster Software

Protect enterprise data, achieve long-term data retention

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

IBM Real-time Compression and ProtecTIER Deduplication

Storage Performance Validation for Panzura

Preserving the World s Most Important Data. Yours. SYSTEMS AT-A-GLANCE: KEY FEATURES AND BENEFITS

The Microsoft Large Mailbox Vision

Overview Brochure. Veritas NetBackup TM Appliance Family

Virtualizing the SAP Infrastructure through Grid Technology. WHITE PAPER March 2007

NEC HYDRAstor Date: September, 2009 Author: Terri McClure, Senior Analyst, and Lauren Whitehouse, Senior Analyst

White paper ETERNUS CS800 Data Deduplication Background

DELL EMC DATA DOMAIN OPERATING SYSTEM

Discover the all-flash storage company for the on-demand world

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

HYDRAstor: a Scalable Secondary Storage

INTRODUCING VERITAS BACKUP EXEC SUITE

Backup and archiving need not to create headaches new pain relievers are around

NEC Sets High Bar in Scale-Out Secondary Storage Market

EMC Integrated Infrastructure for VMware. Business Continuity

Dell DR4000 Replication Overview

Technology Insight Series

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

P R O D U C T I N D E P T H. Evaluating Grid Storage for Enterprise Backup, DR and Archiving: NEC HYDRAstor

SOLUTION BRIEF Fulfill the promise of the cloud

Efficient, fast and reliable backup and recovery solutions featuring IBM ProtecTIER deduplication

The storage challenges of virtualized environments

Optimizing and Managing File Storage in Windows Environments

First Financial Bank. Highly available, centralized, tiered storage brings simplicity, reliability, and significant cost advantages to operations

White paper 200 Camera Surveillance Video Vault Solution Powered by Fujitsu

Balakrishnan Nair. Senior Technology Consultant Back Up & Recovery Systems South Gulf. Copyright 2011 EMC Corporation. All rights reserved.

Moving From Reactive to Proactive Storage Management with an On-demand Cloud Solution

Reduce costs and enhance user access with Lenovo Client Virtualization solutions

The Business Case to deploy NetBackup Appliances and Deduplication

DELL EMC DATA DOMAIN OPERATING SYSTEM

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

CA ARCserve Backup. Benefits. Overview. The CA Advantage

Hitachi Adaptable Modular Storage and Hitachi Workgroup Modular Storage

powered by Cloudian and Veritas

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

DEMYSTIFYING DATA DEDUPLICATION A WHITE PAPER

Evaluator Group Inc. Executive Editor: Randy Kerns

Archiving, Backup, and Recovery for Complete the Promise of Virtualisation Unified information management for enterprise Windows environments

Storage s Pivotal Role in Microsoft Exchange Environments: The Important Benefits of SANs

DEDUPLICATION BASICS

Quest DR Series Disk Backup Appliances

TOP REASONS TO CHOOSE DELL EMC OVER VEEAM

Reducing Costs in the Data Center Comparing Costs and Benefits of Leading Data Protection Technologies

IBM TS7700 grid solutions for business continuity

Simplify Backups. Dell PowerVault DL2000 Family

Networking for a smarter data center: Getting it right

Technology Insight Series

StorageCraft OneXafe and Veeam 9.5

High performance and functionality

FAQ. Frequently Asked Questions About Oracle Virtualization

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Hitachi Adaptable Modular Storage and Workgroup Modular Storage

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

10 ways to securely optimize your network. Integrate WAN acceleration with next-gen firewalls to enhance performance, security and control

White paper: Agentless Backup is Not a Myth. Agentless Backup is Not a Myth

The Fastest Scale-Out NAS

Verron Martina vspecialist. Copyright 2012 EMC Corporation. All rights reserved.

Data Sheet FUJITSU Storage ETERNUS CS200c S3

HYDRAstor: a Scalable Secondary Storage

Veeam with Cohesity Data Platform

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

7 Ways Compellent Optimizes VMware Server Virtualization WHITE PAPER FEBRUARY 2009

Data safety for digital business. Veritas Backup Exec WHITE PAPER. One solution for hybrid, physical, and virtual environments.

HPE SimpliVity 380. Simplyfying Hybrid IT with HPE Wolfgang Privas Storage Category Manager

IBM Storage Software Strategy

IBM Storwize V7000: For your VMware virtual infrastructure

IBM System Storage DS5020 Express

Trends in Data Protection CDP and VTL

Building Backup-to-Disk and Disaster Recovery Solutions with the ReadyDATA 5200

Symantec Enterprise Vault

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

See what s new: Data Domain Global Deduplication Array, DD Boost and more. Copyright 2010 EMC Corporation. All rights reserved.

Integrated and Hyper-converged Data Protection

Cohesity Flash Protect for Pure FlashBlade: Simple, Scalable Data Protection

Data Protection for Virtualized Environments

DELL EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE

Efficient Data Center Virtualization Requires All-flash Storage

Veeam Availability Solution for Cisco UCS: Designed for Virtualized Environments. Solution Overview Cisco Public

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

HP StoreOnce: reinventing data deduplication

Solution Brief: Commvault HyperScale Software

Microsoft DPM Meets BridgeSTOR Advanced Data Reduction and Security

Reasons to Deploy Oracle on EMC Symmetrix VMAX

Hitachi Virtual Storage Platform Family: Virtualize and Consolidate Storage Management

Transcription:

Maximizing Data Efficiency: Benefits of Global Deduplication Advanced Storage Products Group Table of Contents 1 Understanding Deduplication 2 Scalability Limitations 4 Scope Limitations 5 Islands of Capacity 6 NEC s HYDRAstor 8 Unified Archive and Backup 9 Summary Sidebar Discussion: Determining Dedupe Ratios Understanding Deduplication: The Complete Picture Data deduplication as a concept was relatively unknown at the beginning of 2007, yet ended the year as a $100 million market of its own. Along this trajectory there has been a lack of education and a lot of hype regarding the technology. First generation deduplicating VTL and appliance vendors seemed to focus mainly on the dedupe ratio, claiming ever larger dedupe ratios in an effort to show technological superiority. The danger in this one note band approach is that dedupe ratios alone do not paint a complete picture of efficient, much less safe, deduplication. The hidden downside of first-generation deduplication products is that due to architectural deficiencies and capability limitations, they tend to create silos of deduped data and perpetuate disparate islands of management and data stores. These architectural limitations reduce data deduplication efficiency while increasing complexity and management costs. Companies looking to upgrade tape-based backup and archive environments need data deduplication solutions that provide global data deduplication while simplifying management and decreasing complexity. Steadily growing data stores and flat IT headcounts and budgets require companies to look beyond ratios to select solutions with architectures that can deliver the scalability and scope to permanently resolve these problems. Powered by Intel Xeon Processors Page 1 of 9

Scalability Limitations Create Deduped Data Silos The ongoing growth of new application data coupled with longer retention requirements amplify the scalability limitations of first-generation deduplication products, leading IT organizations to deploy multiple instances of these products to get around performance and capacity limitations and keep pace with growing backup and archive needs. This results in silos of deduped data and islands of underutilized capacity, which reduce the efficiency of deduplication. Lack of performance scalability is typically the first limitation hit. Just meeting current backup windows and recovery times may require more concurrent backup streams and performance than a single deduplicating VTL or appliance can deliver. This limitation leads IT to deploy a second, third, or more systems to meet current business requirements. The problems associated with these deployments become exacerbated as more and more deduplication systems are needed to keep pace with year over year data growth, shrinking backup windows, and more stringent retrieve requirements. With regard to capacity, first-generation deduplicating products will run out of disk capacity, thus requiring the deployment of yet another instance of the product. Since first generation dedupe products cannot extend their index of deduplicated data across multiple product instances, data deduplication occurs separately within each single instance of the VTL or appliance. Each additional instance of the dedupe product must once again start its own deduplication process anew, even if it stores application data previously archived or backed up on one of the previous instances. As shown in Figure 1, the inability of deduplicating solutions to dedupe across multiple instances results in a proliferation of redundant data being stored and creates silos of deduped data that increase capacity requirements and costs. Figure 1: Duplicate data proliferates due to scalability limitations and lack of global deduplication across multiple first generation appliances. When factoring in both performance and capacity scalability limitations of appliance based solutions, new instances of the appliance must be deployed even if only one of those constraints is encountered (see Figure 2). This results in poor utilization of both performance and capacity resources, due to inability to leverage all available resources across appliances. Page 2 of 9

Figure 2: Lack of independent linear scalability of performance and capacity within an appliance or virtualization of those resources across appliances leads to the deployment of additional appliances. Many vendors attempt to hide true product capacity limitations by touting scalability numbers beyond a single system s true capability and claiming their GUI manages multiple instances. While this approach may reduce the management burden, it does not eliminate the silos of deduplicated data and islands of underutilized capacity caused by deploying multiple dedupe systems. Furthermore, it does not eliminate the overhead of managing separate segregated systems independently in terms of capacity and throughput provisioning to accommodate growth. With multiple instances of deduplicating products, most companies can expect an impact of at least 20% to even 50% or more in deduplication efficiency depending on their environment in terms of the number of appliances deployed and similarity of data stored across those appliances. As illustrated in the following graph (Figure 3), this loss in dedupe efficiency can easily result in a much lower overall dedupe ratio over time, resulting in up to 25-100% increase in disk capacity purchases. Page 3 of 9

Determining Dedupe Ratios There is no exact formula companies can universally apply to know what dedupe ratio they might reach. However, there are environmental characteristics that can be identified prior to implementing dedupe technology to forecast the relative level of dedupe ratio that might be achieved. The type of data stored: Different kinds of files have very different dedupe characteristics. Spreadsheets, general documents and file system data: There tends to be a fair amount of data duplication, with incremental changes made to base documents which remain mostly the same from iteration to iteration. Email can also contain a lot of redundant data. Image formats: the underlying data in new files may have little or nothing in common with existing image files, so dedupe may have little to no impact to reduce disk storage capacity. Also, many image formats are compressed already, so even a dedupe technology with additional compression may have little impact on saving disk capacity. Net-net, if your data stream has more general docs and email than images, you may get higher deduplication ratios. Figure 3: Multiple disparate appliances lead to efficiency loss in overall dedupe ratio, resulting in up to 25-100% increase in disk capacity purchases. Scope Limitations Create Deduped Data Silos The other key to realizing the full benefits of deduplication is to extend its scope by combining archive and/or backup data stores or other applications so that these collective data stores are globally deduplicated. Even though no requirements exist for companies to store backup and archive data on separate systems with different architectures, this common approach further segregates deduplicated data by application or scope and, similar to the scalability limitation, reduces the effectiveness and value of data deduplication. Deduplication inefficiency due to limited application scope with appliance based approaches is especially true for archive and backup data, since data that is archived has almost always already been previously backed up multiple times (see Figure 4). Most companies possess a great deal of redundant data between their archive and backup data stores that cannot be globally deduplicated across the disparate instances and architectures of deduplicating VTLs and appliances. Figure 4: Separate backup and archive applications with separate deduplicating appliances result in duplicate data. Page 4 of 9

Determining Dedupe Ratios (cont d) Multiple copies of the same file: If a company knows or suspects there are multiple copies of the same file scattered throughout the organization, the benefits of deduplication could be significant. However, to realize the full benefits of deduplication requires companies to perform global deduplication, otherwise multiple copies of the deduplicated file will still exist. Backing up production copies of data used in testing and development: The benefits of deduplication are potentially huge in these circumstances since testing and development data will closely resemble production data. However, companies can only obtain the benefits of deduplicating across these different data stores if they all back up and archive to the same deduplicating system. Archiving less frequently used data: The benefits of dedupe are potentially large if the company is already backing up data to the same dedupe system to enable global deduplication. Islands of Capacity Increase Complexity and Costs As the number of dedupe systems increases, so do complexity and cost. Scalability and scope limitations in first-generation dedupe products proliferate islands of capacity and data management nightmares across the data center. Each system deployed has its own set of management needs. Administrators are forced to independently access each system to provision it, set policies, and monitor and manage system performance and capacity. The cost and complexity of managing this environment further increase since administrators often need to manually load balance, tune, and optimize the performance and capacity utilization of archive and backup data across multiple deduplicating systems. With multiple instances of deduplicating VTLs and appliances for each application, as well as platforms with different architectures for backup and archive, the complexity is multiplied (see Figure 5). As a result, implementing and managing these multiple islands of deduplication negate many of the benefits of deduplication and result in a higher total cost of ownership (TCO). Manual load balancing of backup streams and schedules can make small improvements in utilization but are time consuming, error prone, need to be performed continuously, and will likely introduce inefficiencies into the backup and archive environment. Figure 5: Limited scope and segregated infrastructure lead to inefficiencies and management complexity without global deduplication and virtualization across disparate systems. Page 5 of 9

Determining Dedupe Ratios (cont ) The following chart (Figure 6) highlights the impact of islands of capacity on capacity utilization in terms of increased inefficiency as IT deploys more disparate dedupe systems. Unused capacity headroom can be reduced if IT can scale capacity needs effectively within a single grid infrastructure. This capacity utilization improvement can significantly reduce overall costs versus disparate appliances as more total capacity is needed over time. Type of backups run: Full backups lead to higher deduplication ratios relatively quickly. If companies perform incremental or differential backups, they will see lower deduplication ratios, as the backups are already eliminating some degree of duplication by only backing up changed files. How long backup or archive data is retained: The longer companies retain data, the better chance that the deduplication appliance has of matching new data with existing data and increasing the deduplication ratio. Deduplication appliances that cannot scale performance or capacity, will not be able to achieve this golden matching factor. Local versus global deduplication index: The deduplication index is one of the key factors to achieving higher deduplication ratios. Firstgeneration dedupe products index within a system and not across two or more systems, resulting in proliferation of redundant data and silos of deduped data. Figure 6: Segregated appliances lead to capacity utilization inefficiency, as each system must manage its own capacity and throughput overhead and cannot leverage available resources of other appliances. NEC HYDRAstor Delivers Global Deduplication HYDRAstor s grid architecture uniquely addresses both the scope and scalability issues that companies face with first-generation deduplicating VTLs and appliances by creating a unified disk storage platform optimized for both archive and backup data. With HYDRAstor, companies can deploy one system that provides true global deduplication across the entire datacenter solution for multiple applications to minimize overall capacity needs, simplify management, and provide the lowest TCO among deduplicating disk-based solutions. Global deduplication maximizes dedupe ratios and capacity optimization, estimated at 20% to 50% greater efficiency than isolated local deduplication across multiple appliances. Scalability: Eliminating Dedupe Silos HYDRAstor creates a single logical pool of storage using a grid storage architecture that supports massive scale-out of performance and capacity. HYDRAstor s grid storage architecture can start small but supports massive scalability of performance, capacity, or both, linearly and independently. Today, HYDRAstor can scale from 1.8 TB/hr to 90 TB/hrof throughput and from 10TB to more than 20PB of capacity as a single, sharable system while layering into existing backup networks non-disruptively (see Figure 7) NEC HYDRAstor decouples performance nodes (Accelerator Nodes) from capacity nodes (Storage Nodes). DynamicStor, the intelligent management software of HYDRAstor, enables independent linear scalability of both performance (through Accelerator Nodes) and capacity (through Storage Nodes) so IT can meet current business needs, regardless of current size or growth. Page 6 of 9

Benefits of Global Deduplication Backup Archive Content Depot NFS CIFS 90 TB/hr Performance 1.8 TB/hr One Logical Pool 20 PB Capacity 10TB Secured Encrypted Transmission RepliGrid TM Asynchronous Replication WAN-Optimized Replication One Logical Pool DataRedux TM Deduplication Distributed Resilient Data TM (DRD) HYDRAlock TM User Respositories Disaster Recovery Figure 7: HYDRAstor deployment in a typical backup and archive scenario HYDRAstor is designed as a distributed two-tier grid architecture made up of Accelerator Nodes (ANs) and Storage Nodes (ANs) using industry standard servers. Archiving or backup applications connect to the ANs, which are optimized to process NFS or CIFS I/O requests. The ANs distribute data across SNs to maximize availability, throughput, and management efficiency across a common pool of storage. HYDRAstor provides any-to-any connectivity across all nodes (regardless of type), enabling ANs to store or retrieve data from any of the SNs regardless of the size of the HYDRAstor system. The SNs behave as one large, shareable pool of capacity that is fully self-managing and selftuning. The SNs automatically and non-disruptively balance capacity and performance across all of the nodes, as well as rebalance the system if new nodes (capacity and inline processing power) are added or removed. Data is automatically routed to the node responsible for processing that portion of the distributed hash table, so deduplication can be completed globally across all data, regardless of which AN or application the data came from (see Figure 8). HYDRAstor s patent pending deduplication technology, DataRedux, uniquely deduplicates data across all data stored on the system to provide true global deduplication. Figure 8: HYDRAstor s grid architecture and distributed hash table ensures data is automatically routed to the responsible node and processed inline, while maintaining linear scalability as capacity grows. Page 7 of 9

Scope: Unified Archive and Backup on a Single Platform NEC HYDRAstor is optimized to store and manage both archive and backup data and can deduplicate data across multiple applications to maximize deduplication s effectiveness, resulting in higher overall deduplication ratios. Given its very nature, backup data is very redundant as companies back up the same data multiple times and retain those copies for 30 to 90 days on average. Since backups are typically taken prior to data being archived, redundancy also exists between backup and archive copies as well. Companies frequently implement archiving to improve application performance and to free up capacity on tier-1 storage systems by moving infrequently accessed data to lower tiers of storage. For example, most enterprise organizations now archive old emails on lower tier storage systems to satisfy new legal requirements for data retention and to keep storage costs for production email systems under control. The downside is that this process results in companies storing the same email and its attachment(s) multiple times as it is stored in both archive and backup storage repositories. Using HYDRAstor as a single, unified platform for both archive and backup, the archived data is 100% deduplicated against the backup data. HYDRAstor also eliminates the need to introduce and manage additional tiers of storage to host archived data. The savings companies can expect through this type of consolidation are substantial. Aside from the storage capacity savings associated by recouping and using less tier-1 disk space, companies do not need to dedicate as many staff resources to manage the archived and backup data stored on HYDRAstor as they would for other solutions. NEC HYDRAstor: Simplified Management, Lower Costs When it comes to storage management, one is better than many. NEC HYDRAstor s scalability and flexibility reduce storage management complexity and costs in the following ways: Reduces management complexity by reducing the number of storage devices companies need to manage to one. HYDRAstor s DynamicStor software additionally eliminates onerous management tasks such as provisioning, load balancing and even data migration between nodes to further simplify management. Lowers capacity costs by creating one large, logical, and efficient pool of storage with minimum amount of storage overhead. Companies can seamlessly scale HYDRAstor to whatever performance or capacity size the combined archive and backup repository requires to eliminate islands of underutilized capacity and silos of deduped data. Page 8 of 9

Summary Companies are now at a juncture where they have a unique opportunity to put in place an underlying backup and archive disk infrastructure that eliminates today s silos of deduped data and islands of underutilized capacity resulting from scalability and scope limitations. These deficiencies common to first-generation dedupe VTLs and appliances increase IT s costs associated with managing and keeping pace with backup and archive requirements. Silos of deduped data and islands of underutilized capacity can increase disk requirements by 100% or more. The proliferation of multiple deduplicating systems increases onerous storage management tasks to increase management costs and overall environment complexity. NEC s HYDRAstor uses an innovative grid architecture that provides the scalability and scope companies need to support their growing backup and archive needs on a single system to fully optimize deduplication globally for all data across all nodes. Its grid architecture is a new breed of simple to manage and grow grid storage systems that companies need to implement to offset the complexity prevalent today. HYDRAstor is the one system that companies can deploy now to meet their archiving and backup needs forever. About NEC Dynamic IT Infrastructure NEC's HYDRAstor is part of NEC's Dynamic IT Infrastructure that includes servers, storage, virtual desktop solutions, and system software, which are smart, flexible, adaptive to change, scalable, resilient, and continuously evolving. Along with NEC's broad range of services, the NEC Dynamic IT Infrastructure provides an ideal platform for virtualization, consolidation and business continuity and is ideal for driving greater value and efficiencies in solutions for physical security, law enforcement, emergency response, travel and entertainment, education, high performance computing, and business. This type of infrastructure allows IT organizations to move forward confidently and meet changing and growing business needs in an efficient manner. About NEC Corporation of America NEC Corporation of America is a leading technology provider of network, IT and identity management solutions. Headquartered in Irving, Texas, NEC Corporation of America is the North America subsidiary of NEC Corporation. NEC Corporation of America delivers technology and professional services ranging from server and storage solutions, IP voice and data solutions, optical network and microwave radio communications to biometric security and virtualization. NEC Corporation of America serves carrier, SMB and large enterprise clients across multiple vertical industries. For more information, please visit http://www.necam.com/dynamicit/. NEC CORPORATION OF AMERICA 2880 Scott Boulevard Santa Clara, CA 95050 1 866 632-3226 1 408 844-1299 sales@necam.com www.necam.com/hydrastor 2009, NEC Corporation of America. HYDRAstor, DynamicStor, DataRedux, Distributed Resilient Data (DRD), RepliGrid and HYDRAlock are trademarks of NEC Corporation; NEC is a registered trademark of NEC Corporation. Intel, the Intel logos, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and other countries. All other trademarks and registered trademarks are the property of their respective owners. All rights reserved. All specifications subject to change.wp111-3_0909 Page 9 of 9