April 2010 Rosen Shingle Creek Resort Orlando, Florida

Similar documents
Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

ADVANCED DATA REDUCTION CONCEPTS

Tom Sas HP. Author: SNIA - Data Protection & Capacity Optimization (DPCO) Committee

Scale-out Data Deduplication Architecture

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

ADVANCED DEDUPLICATION CONCEPTS. Thomas Rivera, BlueArc Gene Nagle, Exar

White paper ETERNUS CS800 Data Deduplication Background

How to solve your backup problems with HP StoreOnce

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation

HOW DATA DEDUPLICATION WORKS A WHITE PAPER

Technology Insight Series

Preserving the World s Most Important Data. Yours. SYSTEMS AT-A-GLANCE: KEY FEATURES AND BENEFITS

Veeam with Cohesity Data Platform

Using Deduplication: 5 Steps to Backup Efficiencies

DEMYSTIFYING DATA DEDUPLICATION A WHITE PAPER

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

The storage challenges of virtualized environments

1 Quantum Corporation 1

The World s Fastest Backup Systems

Storage Optimization with Oracle Database 11g

Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Symantec NetBackup 7 for VMware

Why Datrium DVX is Best for VDI

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

UNLEASH YOUR APPLICATIONS

Chapter 7. GridStor Technology. Adding Data Paths. Data Paths for Global Deduplication. Data Path Properties

See what s new: Data Domain Global Deduplication Array, DD Boost and more. Copyright 2010 EMC Corporation. All rights reserved.

Trends in Data Protection and Restoration Technologies. Jason Iehl, NetApp

Automating Information Lifecycle Management with

Scale-Out Architectures for Secondary Storage

Technology Insight Series

DEDUPLICATION BASICS

Deduplication has been around for several

Flash Decisions: Which Solution is Right for You?

Data Deduplication Methods for Achieving Data Efficiency

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

Protect enterprise data, achieve long-term data retention

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Warsaw. 11 th September 2018

Commvault with Cohesity Data Platform

Understanding Virtual System Data Protection

Virtualization Selling with IBM Tape

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Hedvig as backup target for Veeam

Why In-Place Copy Data Management is The Better Choice

São Paulo. August,

Rio-2 Hybrid Backup Server

Choosing the Right Deduplication Solution for Your Organization

IBM řešení pro větší efektivitu ve správě dat - Store more with less

HP Dynamic Deduplication achieving a 50:1 ratio

EMC DATA DOMAIN PRODUCT OvERvIEW

Market Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010.

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Exadata Implementation Strategy

Microsoft Applications on Nutanix

IBM ProtecTIER and Netbackup OpenStorage (OST)

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Chapter 3 `How a Storage Policy Works

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection

Discover the all-flash storage company for the on-demand world

StarWind Virtual SAN Free

Speeding Up Cloud/Server Applications Using Flash Memory

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Cybernetics Virtual Tape Libraries Media Migration Manager Streamlines Flow of D2D2T Backup. April 2009

NetVault Backup Client and Server Sizing Guide 2.1

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

Get More Out of Storage with Data Domain Deduplication Storage Systems

Efficiently Backing up Terabytes of Data with pgbackrest. David Steele

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Storage validation at GoDaddy Best practices from the world s #1 web hosting provider

Oracle Advanced Compression. An Oracle White Paper June 2007

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Red Hat Enterprise Virtualization (RHEV) Backups by SEP

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Backup and archiving need not to create headaches new pain relievers are around

Identifying and Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

Maximizing Data Efficiency: Benefits of Global Deduplication

Software-defined Storage: Fast, Safe and Efficient

White. Paper. EMC Isilon Scale-out Without Compromise. July, 2012

Mainframe Backup Modernization Disk Library for mainframe

Maximizing your Storage Capacity: Data reduction techniques and performance metrics

It s More Than Just Backup. Sean Craig

Deduplication File System & Course Review

Oracle Recovery Manager Tips and Tricks for On-Premises and Cloud Databases

Deduplication Storage System

How Symantec Backup solution helps you to recover from disasters?

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

ext 118

SQL Server Consolidation with Server Virtualization on NetApp Storage

Storage for HPC, HPDA and Machine Learning (ML)

The Data-Protection Playbook for All-flash Storage KEY CONSIDERATIONS FOR FLASH-OPTIMIZED DATA PROTECTION

EMC Data Domain for Archiving Are You Kidding?

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

For DBAs and LOB Managers: Using Flash Storage to Drive Performance and Efficiency in Oracle Databases

Transcription:

Data Reduction and File Systems Jeffrey Tofano Chief Technical Officer, Quantum Corporation

Today s Agenda File Systems and Data Reduction Overview File System and Data Reduction Integration Issues Reviewing Data Reduction Technologies Reduction and Data Management Data Reduction Technologies On-the-Wire Summary and Questions

About Quantum The global leader in backup, recovery and archive Pioneer in disk backup First with VTL for open systems First with integrated D2D2T system Pioneering patent in variable-length deduplication Dedupe solutions trusted throughout the world Over 800 PB protected by Quantum deduplication technology

Overview Data storage requirements are growing wildly Fastest growth in unstructured data File/object-based storage dominates Data reduction becoming a required feature to deal with growing volume of data Many file system and reduction techniques available Lots of options and tradeoffs for integration What makes sense going forward Many ways to integrate reduction techniques into file systems: Basic integration (on-disk reduction) Data management integration (on-the-move reduction)

File Systems and Reduction Key issues to consider when integrating reduction technology and file systems: What are available reduction techniques Compression Single instancing Dedupe How do selected methods get integrated Layered on top of file system Integrating into file system Build new file system around selected reduction methods(s) What are tradeoffs of each reduction method with respect to selected integration model How does select model affect data management

Data Reduction Technologies We are reduction technologies available to integrate? Compression technologies Tend to operate at file or object scope, and on relatively small chunks. Typically good intrinsic reduction. Little or no cross-object benefits. Single Instance technologies Tend to operate across files/objects/blocks and on relative large chunks Poor intrinsic reduction ; best at removing copies Very sensitive to small changes Dedupe technologies Also operate across files/objects/blocks but at a lower level of granularity than SI Can generate good intrinsic reduction. Can handle small changes well Reduction rate often scales with size of repo (to a point). Hybrids and mixed technologies Tend to be mixes of above schemes The path to the future

Compression: An Overview Lot s of different compression schemes exist At high level, most share common processing model: Build a dictionary (static/dynamic/probabilistic) Encode symbols that represent larger entities in dictionary Replace data e tends in stream ith encoded s mbols Replace data extends in stream with encoded symbols Most schemes are local in scope: reduce individual objects well but doesn t recognize copies of objects Some are general, some are data set specific; reduction rate varies from 2x to 8x Starting to see compression schemes that are opening scope of reduction across objects

Single Instancing: An Overview Again, many different solutions in market, but most operate at file/object scope At high level, operation is simple: File or object is fingerprinted (checksum/hash/signature) File or object fingerprint is indexed When objects move in our out, index is queried to see if matching fingerprint exits Either original object or a reference to existing object is stored Single instancing is typically global in scope (limited by index) Reduction rates vary depending on level of object level copies File and block solutions exist Small changes often result in entirely new object

Dedupe: An Overview Although several different solutions available, there are three major variants: Hash-based variable schemes Similarity/Byte differential schemes Fixed block schemes (essentially same as block SI) At high level, operation is a bit more complex, especially for variable schemes: Some criteria (e.g., simple rolling hash) is used to determine data boundaries and form chunks. Chunks are fingerprinted (complex hash/checksum). As chunks move in and out, index is queried to see if matching fingerprint exists Either unique chunk is stored or a reference to existing chunk Map data must be created to rehydrate the data properly

Dedupe: Additional Details As simple as dedupe process may seem, there are lots of variations: at Where dedupe gets done Client-side, target side or co-operative? In the application, at the file level or at the block level? When dedupe gets done Inline, post-process or in adaptive/hybrid fashion? How smart is the dedupe method Application or data format aware? One parsing model fits all?

Data Reduction: Rules for All of Us The bigger the data set, the more costly to process (IO, CPU, memory bandwidth) Index-based technologies are difficult to scale Lots of little chunks (good for reduction) means lots to search Less big chunks (less reduction) means less to search Index has to be stored somewhere References can introduce fragmentation. Consider math behind a mutli-terabyte store Global scope means many objects with potentially many chunks Many reduction methods only work on sequential data; workloads matter..a lot There is no free lunch somehow, somewhere you have to dehydrate and re-hydrate the data!

Data Reduction: Consequences Each reduction method makes tradeoffs to balance performance cost against reduction ratio: Just because we can reduce wonderfully, doesn t mean customer is willing to pay performance cost. In each class of reduction technology, individual solutions make additional/different tradeoffs Can t mix and match most without bottlenecks. Since SW is often bound to HW, this can lead to a lot of deployment complexity Each reduction method tends to have a sweet spot with respect to data types it can reduce well Solutions must increasingly deal with low and high entropy data Solutions must increasingly deal with pre-compress/encrypted data Customers don t want to manage pools of like data

Integrating Reduction and File Systems There are tradeoffs when integrating different reduction methods: Compression schemes Desired reduction rate can drastically affect performance and trigger excessive RMW operations CPU utilization varies dramatically; offload options still increase access latencies Single Instancing schemes Often troublesome for hot-data that changes a lot Often requires namespace enhancements (i.e., lookup by hash) Dedupe schemes Fixed schemes often far easier than variable schemes Indexing can become central access bottleneck

Data Management and Reduction How does data reduction affect typical data management tasks: What are typical data management features Snapshots Replication Migration and ILM/HSM How do various data reduction schemes map on data management features Cost/benefits in on-disk footprint Cost/benefits when on the wire Cost/benefits when data is tiered and retrieved

Snapshots and Data Reduction Are snapshots and described reduction schemes compatible and/or beneficial? Snapshots and dado are both reference-based technologies Should they be layered or designed together? What are benefits/downsides of each Can deltas be reduced with value in single server and distributed environments

Replication and Data Reduction Since replication typically involves on-the-wire transfers, data reduction benefits are obvious, but: Not all reduction schemes are equivalent Benefits often depend on topology Simple DR setups Edge-to-core setups Distribution and migration setups Distance can dramatically affect benefits of each reduction scheme

Data Reduction On-the-Wire Multiple considerations when moving data over-the-wire: Is data a being moved between ee a data-reduced a repo and traditional a raw system Is data being moved between two systems with same reduction technology When using similar data reduction systems, is existing data being replicated or copied Can multiple data reduction technologies be employed at each stage of movement Mixing file and block level solutions is problematic often, mixing NAS and VTL demonstrate similar problems What media must the data be moved over: high-latency or low-latency? latency? Each data reduction scheme has benefits and downsides in each of above scenarios

Compression On-the-Wire Data compression is most ubiquitous on-the-wire solution Many solutions available.often they don t need to be matched (smart compress/decompressors) Benefits are obvious, but so are costs Less data moves (directly related to reduction rate), which is good, but. System resources are consumed on one (or both) sides depending on the need and model Two identical files being moved are each reduced, but there are still two files transferred; very limited (if any) copy protection afforded Although rare, some schemes required significant static dictionary communications before data can be shipped

Single Instance On-the-Wire Single instance technologies also widely available for onthe-wire reduction Solutions must often be matched many variations in what gets fingerprinted and how; both ends must match Overall scheme is simple Client obtains or calculate an object fingerprint Client sends fingerprint to server Server queries object index and responds Client only sends object if unique Benefits and cost are also obvious File level SI can completely eliminate the transfer of a copy with one back-andforth negotiation Block level SI often goes through a series of fixed size negotiations to accomplish same thing But things work best when fingerprints are known ahead (e.g. replication) When fingerprints are not known ahead of time, they must be calculated; CPU load and costly file buffering can be introduced

Dedupe On-the-Wire Most dedupe vendors offer dedupe-enabled replication, buts there is a lot of variance Most are somewhat complex forms of a simple model Client batch up a group of sequential chunk fingerprints Client send batch to smart target that can query existence of each fingerprint Target sends back results and client pushes unique data Above scheme only works when client/server both can form identical chunks and fingerprints Collaborative dedupe schemes are less common; these schemes provide a method that allows client to chunk and fingerprint data to enable the negotiation Collaborative schemes don t work over the old legacy protocols (NAS); that s starting to change (OST/XAM/pNFS)

Dedupe On-the-Wire Benefits and cost are more subtle: Most dedupe solutions send file/object level hash of hashes to prune copies similar to SI technologies Some solutions provide hierarchical hash-of-hashes to obviate the transfer of large ranges Most solutions can negotiate individual chunks For solutions that negotiate all (or most) chunks, a large number of hash negotiations can result Results can be excellent when much of actual data transfer is obviated Results can add to transfer overhead when dedup ratios are low Cost of hash negotiations serializes data transfers; this can be invisible on low-latency wires but cause significant slow downs on high-latency wires

Data Reduction and ILM/HSM Similar to replication, data reductions benefits seem obvious, but: How do different reduction schemes affect movement from disk-to-disk and/or disk-to-tape? tape? How do different schemes affect read-only copies and version? How do different schemes affect or complement searches and lookups?

Options: How does a Customer Choose? How do you know if a solution works for your type of data? Ask the vendor? Rough math? Try it? Data analysis tools? Sizing tools? Other customer references? Whatever you do, start to understand: What performance level l you need when pushing data to/from the repo What are your data protection/replication needs; do you need to implement on high-latency or low-latency networks (or both)