Data Deduplication Methods for Achieving Data Efficiency

Similar documents
ADVANCED DEDUPLICATION CONCEPTS. Thomas Rivera, BlueArc Gene Nagle, Exar

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation

ADVANCED DATA REDUCTION CONCEPTS

Tom Sas HP. Author: SNIA - Data Protection & Capacity Optimization (DPCO) Committee

Trends in Data Protection and Restoration Technologies. Jason Iehl, NetApp

Restoration Technologies

Application Recovery. Andreas Schwegmann / HP

Trends in Data Protection CDP and VTL

A Promise Kept: Understanding the Monetary and Technical Benefits of STaaS Implementation. Mark Kaufman, Iron Mountain

Tiered File System without Tiers. Laura Shepard, Isilon

The File Systems Evolution. Christian Bandulet, Sun Microsystems

SRM: Can You Get What You Want? John Webster, Evaluator Group.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Cloud Archive and Long Term Preservation Challenges and Best Practices

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

The Role of WAN Optimization in Cloud Infrastructures. Josh Tseng, Riverbed

Scale-out Data Deduplication Architecture

SRM: Can You Get What You Want? John Webster Principal IT Advisor, Illuminata

An Introduction to Key Management for Secure Storage. Walt Hubis, LSI Corporation

Protect enterprise data, achieve long-term data retention

Multi-Cloud Storage: Addressing the Need for Portability and Interoperability

Interoperable Cloud Storage with the CDMI Standard. Mark Carlson, SNIA TC and Oracle Co-Chair, SNIA Cloud Storage TWG

Scaling Data Center Application Infrastructure. Gary Orenstein, Gear6

ASPECTS OF DEDUPLICATION. Dominic Kay, Oracle Mark Maybee, Oracle

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Deploying Public, Private, and Hybrid. Storage Cloud Environments

Virtualization Practices:

in Transition to the Cloud David A. Chapa, CTE EVault, a Seagate Company

SNIA Tutorial 1 A CASE FOR FLASH STORAGE HOW TO CHOOSE FLASH STORAGE FOR YOUR APPLICATIONS

Notes & Lessons Learned from a Field Engineer. Robert M. Smith, Microsoft

Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

Identifying and Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level

Effective Storage Tiering for Databases

Storage Virtualization II Effective Use of Virtualization - focusing on block virtualization -

HP Dynamic Deduplication achieving a 50:1 ratio

Apples to Apples, Pears to Pears in SSS performance Benchmarking. Esther Spanjer, SMART Modular

White paper ETERNUS CS800 Data Deduplication Background

Technology Insight Series

Everything You Wanted To Know About Storage (But Were Too Proud To Ask) The Basics

Thomas Rivera / Hitachi Data Systems. Author: SNIA - Data Protection & Capacity Optimization (DPCO) Committee

Strategies for Single Instance Storage. Michael Fahey Hitachi Data Systems

Single Instance Storage Strategies

A Vendor Agnostic Overview. Walt Hubis Hubis Technical Associates

Overview and Current Topics in Solid State Storage

The Evolution of File Systems

Overview and Current Topics in Solid State Storage

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Virtualization Practices: Providing a Complete Virtual Solution in a Box

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Storage Virtualization II. - focusing on block virtualization -

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

Ron Emerick, Oracle Corporation

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

Deduplication Storage System

Interoperable Cloud Storage with the CDMI Standard. Mark Carlson, SNIA TC and Oracle Chair, SNIA Cloud Storage TWG

Facing an SSS Decision? SNIA Efforts to Evaluate SSS Performance. Ray Lucchesi Silverton Consulting, Inc.

EMC DATA DOMAIN PRODUCT OvERvIEW

LTFS Bulk Transfer Standard PRESENTATION TITLE GOES HERE

Storage in combined service/product data infrastructures. Craig Dunwoody CTO, GraphStream Incorporated

EMC Data Domain for Archiving Are You Kidding?

SCSI Security Nuts and Bolts. Ralph Weber, ENDL Texas

Expert Insights: Expanding the Data Center with FCoE

Architectural Principles for Networked Solid State Storage Access

SAS: Today s Fast and Flexible Storage Fabric. Rick Kutcipal President, SCSI Trade Association Product Planning and Architecture, Broadcom Limited

A Crash Course In Wide Area Data Replication. Jacob Farmer, CTO, Cambridge Computer

Preserving the World s Most Important Data. Yours. SYSTEMS AT-A-GLANCE: KEY FEATURES AND BENEFITS

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Use Cases for iscsi and FCoE: Where Each Makes Sense

1 Quantum Corporation 1

Dell DR4100. Disk based Data Protection and Disaster Recovery. April 3, Birger Ferber, Enterprise Technologist Storage EMEA

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

HP StoreOnce: reinventing data deduplication

Get More Out of Storage with Data Domain Deduplication Storage Systems

Virtualization Selling with IBM Tape

Mark Rogov, Dell EMC Chris Conniff, Dell EMC. Feb 14, 2018

Oracle Zero Data Loss Recovery Appliance (ZDLRA)

DEMYSTIFYING DATA DEDUPLICATION A WHITE PAPER

Zero Data Loss Recovery Appliance DOAG Konferenz 2014, Nürnberg

IBM řešení pro větší efektivitu ve správě dat - Store more with less

SAS: Today s Fast and Flexible Storage Fabric

How to solve your backup problems with HP StoreOnce

PracticeDump. Free Practice Dumps - Unlimited Free Access of practice exam

Design and Implementations of FCoE for the DataCenter. Mike Frase, Cisco Systems

Trends in Worldwide Media and Entertainment Storage

HOW DATA DEDUPLICATION WORKS A WHITE PAPER

Technology Insight Series

Testing. Michael Ault, IBM Oracle FlashSystem Consulting Manager

Storage Performance Management Overview. Brett Allison, IntelliMagic, Inc.

The Business Case to deploy NetBackup Appliances and Deduplication

Planning For Persistent Memory In The Data Center. Sarah Jelinek/Intel Corporation

Tivoli Storage Manager for Virtual Environments: Data Protection for VMware Solution Design Considerations IBM Redbooks Solution Guide

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Veeam and HP: Meet your backup data protection goals

A Candid Examination of Data Deduplication

Felix Xavier CloudByte Inc.

An Introduction to Key Management for Secure Storage. Walt Hubis, LSI Corporation

HYDRAstor: a Scalable Secondary Storage

EMC DATA DOMAIN OPERATING SYSTEM

Data Reduction Meets Reality What to Expect From Data Reduction

WHAT HAPPENS WHEN THE FLASH INDUSTRY GOES TO TLC? Luanne M. Dauber, Pure Storage

Transcription:

Data Deduplication Methods for Achieving Data Efficiency Matthew Brisse, Quantum Gideon Senderov, NEC...

SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced without modification The SNIA must be acknowledged as source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the Author nor the Presenter is an attorney and nothing in this presentation is intended to be nor should be construed as legal advice or opinion. If you need legal advice or legal opinion please contact an attorney. The information presented herein represents the Author's personal opinion and current understanding of the issues involved. The Author, the Presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2

Abstract Data Deduplication Implementation Overview This technical session will address the nuances of data deduplication and the various design approaches available today. Space savings technologies including data deduplication are used to dramatically improve storage efficiency. The session will address the question of what data deduplication is, how it is performed, and the architectural choices available today. It will address various design approaches including fixed length and variable length segmentation, inline and post processing, source and target, availability and resiliency of deduplicated data, and complementary use of removable media. It will also explore the factors affecting space reduction ratios relative to specific capacity optimization techniques. 3

Agenda Definitions Where and How Data Deduplication Works Review Fixed or Variable Segment Approaches Review of Different Design Approach Removable Media Integration What to Expect with Data Deduplication Ratios Questions 4

Definitions Data Deduplication is the process of examining a data-set or I/O stream at the sub-file level and storing and/or sending only unique data. The definition of "what is a duplicate" is predicated upon the method used to evaluate, identify, track and avoid duplication. The deduplication process includes updating tracking information, storing and or sending data that is new and unique, and disregarding any data that is a duplicate. Compression is the encoding of data to reduce its storage requirement. Deduplicated data can also be compressed. Single Instance Storage is the replacement of duplicate files with references to a shared copy. Check out SNIA Tutorial: Green Storage TWG Status & Plans Check out SNIA Tutorial: Introduction To Data Protection Check out SNIA Tutorial: The risk of retention and deletion in the face of FRCP 5

Where Deduplication Can Happen Multiple deployment examples are illustrated specific deployments are selected based on customer situation Remote Office WAN Optimization Appliances WAN Data Center NAS, SAN, VTL, Server Appliance Agents Software Appliance Integrated Replication Storage System Gateway Appliance 6

Deduplication How it Works How Deduplication Works Evaluate Data Identify Redundancy Create or Update Reference Information Store or Transmit Unique Data Once Read or Reproduce Data Details To Follow 7

Deduplication Simplified = new unique data = repeat data 8

Deduplication Simplified = new unique data = repeat data = pointer to unique data segment 9

Deduplication Simplified Dump #1 Dump #2 Dump #3 Dump #4 = new unique data = repeat data = pointer to unique data segment 10

Reading the Data Read Request Read Fulfilled Deduplication Engine Application/CIFS/NFS/VTL Interface Reconstitution/ Verification Metadata References Deduplicated Data 11

Fixed or Variable Segment Deduplication Fixed length segment deduplication Evaluation of data includes a fixed reference window used to look at segments of data during deduplication process Provides fixed granularity, e.g. 4KB, or 8KB, or 128KB Variable length segment deduplication Evaluation of data uses a variable length window to find duplicate data in stream or volume of data processed Provides variable granularity, e.g. Average 4KB or 32KB Method chosen may affect deduplication results Effects observed will vary by method Segmentation may not apply to all deduplication 12

Design Approach Gateway Hardware capable of deduplicating data excluding storage hardware Appliance Hardware capable of deduplicating data and/or storing deduplicated data including the storage hardware dedicated for this purpose Storage System General purpose storage system that is capable of deduplicating data Grid Storage Distributed system capable of deduplicating data across multiple independent components that can scale capacity and performance Agent Client-side software capable of deduplicating data for a specific application Software Appliance Server-side software capable of storing deduplicated data from application agents Storage Software Software capable of deduplicating data and/or storing deduplicated data that is neither an agent nor a software appliance 13

Target Deduplication Stand-alone central repository for data Could be NAS, SAN, VTL, other Could be attached via Ethernet, FC, or both Deduplicates data as internal process either in-line, post-process, or both May include self-contained or gateway configurations May include other methods of data reduction in addition to deduplication, such as compression, or object level differencing Allows global deduplication across all clients Scale of deduplication space may vary by implementation Data Mover (Backup Server) LAN Or: Deduplication VTL Appliance Deduplication Storage Appliance 14

Source Deduplication Client software or agent installed at deduplication point Example: Deduplication agent installed on client / server Agent scans for new or modified data Unchanged data simply referenced again without further processing Allows incremental forever approach while providing full backups Typically shortens backup windows relative to both fulls and incrementals Only unique data sent to central repository Reduction in data transmitted and stored Could allow global deduplication across all clients Scale of Deduplication space may vary by implementation Ideal for branch office applications May not require additional hardware at remote sites Modest CPU impact May include other methods of data reduction in addition to deduplication, such as compression Source Deduplication Management Console LAN Storage Repository for deduplicated data Deduplication Software Installed 15

Deduplication for Near-line Use Application Servers LAN ILM / HSM Data Classification Processor SAN Deduplication Near-line Storage Primary SAN 16

Deduplication for Backup and Recovery Appliance Deduplication Software Deduplication Deduplication Software Installed LAN LAN Central Backup Server/Media Server Deduplicated Backup to Disk Target Management Console Storage Repository for Deduplicated Data Application Servers Removable Media LAN Removable Media Data Mover (Backup Server) Deduplication VTL Appliance Removable Media 17

Scope of Data Deduplication Data Deduplication for ALL data from ALL sources across the entire solution Single copy of data across entire solution Can span multiple solution components Global Deduplication Reference to a repository of deduplicated data spanning multiple computing systems to identify duplicate data. Local Deduplication Reference to a repository of deduplicated data storing data originating from a single storage system Can span multiple applications Backup Archive Nearline Can span multiple datacenters 18

Removable Media Integration Deduplication integration with tape Create long term removable media storage for compliance and archive Backup Server Different data path approaches Path through backup server Path direct from Deduplication to removable media storage Two Options for VTL Application-specific removable media integration: backup software initiates and manages the entire process Dedup device manages the removable media environment autonomously Dedup Removable Media 19

What to Consider Factors that will impact your results: Different applications or data types Bandwidth and latency Backup/Archive Policies Data Protection Overhead Compression and Encryption Global Deduplication and Scope Deduplicated Data Resiliency Scalability Capacity Performance Legal Considerations 20

Estimating the Deduplication Ratio Your mileage may vary! Factors associated with higher data deduplication ratios Data created by users Low change rates Reference data and inactive data Applications with lower data transfer rates Use of full backups Longer retention of deduplicated data Wider scope of data deduplication Continuous business process improvement Smaller segment size Variable-length segment size Content-aware Temporal data deduplication Factors associated with lower data deduplication ratios Data captured from mother nature High change rates Active data, encrypted data, compressed data Applications with higher data transfer rates Use of incremental backups Shorter retention of deduplicated data Narrower scope of data deduplication Business as usual operational procedures Larger segment size Fixed-length segment size Content-agnostic Spatial data deduplication Understanding data deduplication ratios 21

Deduplication of Backup Data Data 22

Deduplication of Non-Backup Data Data 23

The Space Reduction Ratio Bytes In Capacity Optimization Bytes Out Ratio = Bytes In Bytes Out Space Reduction % = 1 - ( ) 1 Ratio Example: Bytes In: (7 Days Full Backup) x (100GB per day) = 700GB Bytes Out: 52GB changed data actually stored Space Reduction Ratio = 700GB / 52GB = 13.5 Space Reduction % = 1 - (1 / 13.5) = 92.6% Understanding data deduplication ratios 24

The Deduplication Ratio is influenced by the Backup Methodology Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Changed files Bytes out 40 GB 1 GB 2 GB 3 GB 1 GB 2 GB 3 GB 43 GB 52 GB Daily Full Bytes in 6.9 x 13.5 x Weekly Full + Daily Cum. Incr. Bytes in 3.0 x 5.6 x 100 GB 100 GB 100 GB 100 GB 100 GB 100 GB 100 GB 300 GB 700 GB 100 GB 10 GB 20 GB 30 GB 40 GB 40 GB 50 GB 130 GB 290 GB Weekly Full + Daily Diff. Incr. Bytes in 2.8 x 3.5 x 100 GB 10 GB 10 GB 10 GB 10 GB 20 GB 20 GB 120 GB 180 GB

Q&A / Feedback Please send any questions or comments on this presentation to SNIA: trackdatamgmt@snia.org Many thanks to the following individuals for their contributions to this tutorial. - SNIA Education Committee Data Deduplication and Space Reduction SIG Rory Bolt Matthew Brisse Mike Dutch Michael Fishman Larry Freeman Devin Hamilton Jason Iehl Shane Jackson Jeff Porter Gideon Senderov Jim Shocrylas 26