Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Similar documents
Deduplication Storage System

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Deduplication File System & Course Review

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Deploying De-Duplication on Ext4 File System

The Logic of Physical Garbage Collection in Deduplicating Storage

Speeding Up Cloud/Server Applications Using Flash Memory

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

HP Dynamic Deduplication achieving a 50:1 ratio

Rethinking Deduplication Scalability

Reducing Replication Bandwidth for Distributed Document Databases

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

EMC Data Domain for Archiving Are You Kidding?

Providing a first class, enterprise-level, backup and archive service for Oxford University

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

UNIVERSITY OF CALIFORNIA SANTA CRUZ

HP StoreOnce: reinventing data deduplication

An Application Awareness Local Source and Global Source De-Duplication with Security in resource constraint based Cloud backup services

Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long, Mark Lillibridge

EMC DATA DOMAIN OPERATING SYSTEM

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Data De-duplication for Distributed Segmented Parallel FS

A study of practical deduplication

HP s VLS9000 and D2D4112 deduplication systems

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Protect enterprise data, achieve long-term data retention

DEC: An Efficient Deduplication-Enhanced Compression Approach

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Data Reduction Meets Reality What to Expect From Data Reduction

Balakrishnan Nair. Senior Technology Consultant Back Up & Recovery Systems South Gulf. Copyright 2011 EMC Corporation. All rights reserved.

the minimum feature hashes of the given document. We are analyzing here this technique for deduplication workloads.

Technology Insight Series

ext 118

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

EMC DATA DOMAIN PRODUCT OvERvIEW

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

DASH COPY GUIDE. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31

Advancing the AFF4 to the challenges of Volatile Memory & Single Hashes

Zero-Chunk: An Efficient Cache Algorithm to Accelerate the I/O Processing of Data Deduplication

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 25 Feb 2009 Spring 2010 Exam 1

Byte Index Chunking Approach for Data Compression

Purity: building fast, highly-available enterprise flash storage from commodity components

White paper ETERNUS CS800 Data Deduplication Background

Data Deduplication Methods for Achieving Data Efficiency

IBM Real-time Compression and ProtecTIER Deduplication

Building a High-performance Deduplication System

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

EMC for Mainframe Tape on Disk Solutions

Configuration Guide for Veeam Backup & Replication with the HPE Hyper Converged 250 System

The World s Fastest Backup Systems

Deduplication: The hidden truth and what it may be costing you

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Cumulus: Filesystem Backup to the Cloud

Chapter 10 Protecting Virtual Environments

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

A. Deduplication rate is less than expected, accounting for the remaining GSAN capacity

Can t We All Get Along? Redesigning Protection Storage for Modern Workloads

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Veeam and HP: Meet your backup data protection goals

bup: the git-based backup system Avery Pennarun

Erik Riedel Hewlett-Packard Labs

NetVault Backup Client and Server Sizing Guide 2.1

A Survey and Classification of Storage Deduplication Systems

1 of 8 10/10/2018, 12:52 PM RM-01, 10/10/2018. * Required. 1. Agency Name: * 2. Fiscal year reported: * 3. Date: *

Your World is Hybrid: Protecting your VMs with Veeam and HPE Storage. Federico Venier HPE Storage Technical marketing

SAP HANA Inspirience Day Workshop SAP HANA Infra. René Witteveen Master ASE Converged Infrastructure, HP

P-Dedupe: Exploiting Parallelism in Data Deduplication System

Mainframe Virtual Tape: Improve Operational Efficiencies and Mitigate Risk in the Data Center

See what s new: Data Domain Global Deduplication Array, DD Boost and more. Copyright 2010 EMC Corporation. All rights reserved.

Oracle Zero Data Loss Recovery Appliance (ZDLRA)

QuickSpecs. PCIe Solid State Drives for HP Workstations

CS 550 Operating Systems Spring File System

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

When failure is not an option HP NonStop BackBox VTC

SMORE: A Cold Data Object Store for SMR Drives

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

SPECIFICATION FOR NETWORK ATTACHED STORAGE (NAS) TO BE FILLED BY BIDDER. NAS Controller Should be rack mounted with a form factor of not more than 2U

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

STOREONCE OVERVIEW. Neil Fleming Mid-Market Storage Development Manager. Copyright 2010 Hewlett-Packard Development Company, L.P.

Iomega REV Drive Data Transfer Performance

HP StorageWorks VLS and D2D solutions guide

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

Backup Exec 20.1 Tuning and Performance Guide

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

DUE to the explosive growth of the digital data, data

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

Virtualization Selling with IBM Tape

IBM System Storage LTO Ultrium 6 Tape Drive Performance White Paper

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Map-Reduce. Marco Mura 2010 March, 31th

Transcription:

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble Work done at Hewlett-Packard laboratories 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

The Problem: deduplication at scale for disk-to-disk backup

A Disk-to-Disk Backup Scenario D2D server streaming data (fake tape library) Each tape: up to 400 GB Total: 100 TB 10 PB 3 February 26, 2009

Example backup streams Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 4 February 26, 2009

Little changes from day-to-day Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 5 February 26, 2009

After ideal deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 6 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 7 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 8 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 9 February 26, 2009

The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 10 February 26, 2009

The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 11 February 26, 2009 Chunk-lookup disk bottleneck

One existing solution Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Benjamin Zhu, Data Domain, Inc.; Kai Li, Data Domain, Inc., and Princeton University; Hugo Patterson, Data Domain, Inc. FAST 08. Today: a new approach that uses significantly less RAM provides a guaranteed minimum throughput 12 February 26, 2009

Our Approach: Sparse indexing

Sparse indexing Key ideas: Chunk locality Sampling 14 February 26, 2009

No temporal locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 15 February 26, 2009

Large sections of data reappear mostly intact chunk locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 16 February 26, 2009

Exploiting chunk locality file B file D file X file D 17 February 26, 2009

Divide into segments 2 3 4 5 file B file D new file X file D Chunks not shown, real segments much longer 18 February 26, 2009

Deduplicate one segment at a time 2 3 4 5 file B file D new file X file D Against a few carefully chosen champion segments 19 February 26, 2009

Champion #1: the most similar segment 2 3 4 5 file B file D new file X file D 20 February 26, 2009

Champion #2: most similar to remainder 2 3 4 5 file B file D new file X file D 21 February 26, 2009

Finding similar segments by sampling 2 3 4 5 file B file D file X file D Sparse Index: samples containing segment(s) 22 February 26, 2009

A few details Also keep segment recipes: list of pointers to a segment s chunks Actually deduplicate against champion recipes Better with variable-sized segments boundaries based on landmarks ( superchunks ) reduces number of champions required 23 February 26, 2009

Putting it all together byte stream Chunker chunks Segmenter segments Champion samples chooser segment IDs champion IDs Deduplicator Sparse index updates champion recipes new recipes new chunks (compressed) Segment recipes Disk storage Chunk data files 24 February 26, 2009

Results

Methodology Built a simulator Fixed parameters: 4 KB mean chunk size variable-size segments maximum of 1 segment ID kept per sample Varying parameters: mean segment size sampling rate maximum number of champions per segment (M) 26 February 26, 2009

The data sets Workgroup [this talk] 3.8 TB backups of 20 desktop PCs belonging to engineers semi-regular backups over 3 months via tar 154 full backups and 392 incremental backups end-of-week full backups are synthetic SMB [see paper] backups of a server with real Oracle data synthetic Microsoft Exchange data two weeks 0.6 TB 27 February 26, 2009

Chunk locality exists 28 February 26, 2009

Sampling can exploit most of it (10 MB mean segment size) 29 February 26, 2009

Deduplication with at most 10 champions 30 February 26, 2009

Deduplication depends primarily on 31 February 26, 2009

Index RAM usage 1/128 1/64 1/32 32 February 26, 2009

Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index 33 February 26, 2009

Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index When chunk locality is poor, deduplication quality remains constant but throughput degrades Find all duplicate chunks but larger chunk size 34 February 26, 2009

Ram usage comparison 1/128 1/64 1/32 4 KB 8 KB 16 KB 32 KB 35 February 26, 2009

What about all those disk accesses? Infrequent due to batch processing Example: load at most 10 champions per 10 MB segment average of 1.7 champions per 10 MB segment = 0.17 champions/mb = 1 seek per 5 MB I/O burden: 20 ms to load a champion recipe (~100 KB) 1 drive can handle > 250 MB/s ingestion rate 36 February 26, 2009

37 February 26, 2009 Thank You