A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Similar documents
DEC: An Efficient Deduplication-Enhanced Compression Approach

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Reducing Replication Bandwidth for Distributed Document Databases

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

The Effectiveness of Deduplication on Virtual Machine Disk Images

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Deduplication Storage System

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Speeding Up Cloud/Server Applications Using Flash Memory

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

A Scalable Inline Cluster Deduplication Framework for Big Data Protection

Contents Part I Traditional Deduplication Techniques and Solutions Introduction Existing Deduplication Techniques

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

Deploying De-Duplication on Ext4 File System

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Alternative Approaches for Deduplication in Cloud Storage Environment

SmartMD: A High Performance Deduplication Engine with Mixed Pages

DUE to the explosive growth of the digital data, data

Data Reduction Meets Reality What to Expect From Data Reduction

Purity: building fast, highly-available enterprise flash storage from commodity components

Optimizing Flash-based Key-value Cache Systems

Virtualization Technique For Replica Synchronization

Benefits of Storage Capacity Optimization Methods (COMs) And. Performance Optimization Methods (POMs)

Application-Aware Big Data Deduplication in Cloud Environment

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services

Deduplication File System & Course Review

Reducing Replication Bandwidth for Distributed Document Databases

Reducing Replication Bandwidth for Distributed Document Databases

An Application Awareness Local Source and Global Source De-Duplication with Security in resource constraint based Cloud backup services

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving

Knockoff: Cheap versions in the cloud. Xianzheng Dou, Peter M. Chen, Jason Flinn

Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

HEAD HardwarE Accelerated Deduplication

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

Online Deduplication for Databases

arxiv: v3 [cs.dc] 27 Jun 2013

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Multi-level Byte Index Chunking Mechanism for File Synchronization

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

UNIC: Secure Deduplication of General Computations. Yang Tang and Junfeng Yang Columbia University

Flashed-Optimized VPSA. Always Aligned with your Changing World

DEDUPLICATION AWARE AND DUPLICATE ELIMINATION SCHEME FOR DATA REDUCTION IN BACKUP STORAGE SYSTEMS

April 2010 Rosen Shingle Creek Resort Orlando, Florida

Compression and Decompression of Virtual Disk Using Deduplication

P-Dedupe: Exploiting Parallelism in Data Deduplication System

APPLICATION-AWARE LOCAL-GLOBAL SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES OF PERSONAL STORAGE

EaSync: A Transparent File Synchronization Service across Multiple Machines

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

DATABASE COMPRESSION. Pooja Nilangekar [ ] Rohit Agrawal [ ] : Advanced Database Systems

Deduplication and Its Application to Corporate Data

Cheetah: An Efficient Flat Addressing Scheme for Fast Query Services in Cloud Computing

Hierarchical Substring Caching for Efficient Content Distribution to Low-Bandwidth Clients

A New Key-Value Data Store For Heterogeneous Storage Architecture

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

Dell EMC SAP HANA Appliance Backup and Restore Performance with Dell EMC Data Domain

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

TAPER: Tiered Approach for Eliminating Redundancy in Replica Synchronization

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. X, XXXXX Boafft: Distributed Deduplication for Big Data Storage in the Cloud

Deduplication: The hidden truth and what it may be costing you

An Experimental Study of Rapidly Alternating Bottleneck in n-tier Applications

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

The Logic of Physical Garbage Collection in Deduplicating Storage

The Power of Prediction: Cloud Bandwidth and Cost Reduction

NetVault Backup Client and Server Sizing Guide 2.1

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Oasis: An Active Storage Framework for Object Storage Platform

Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures

NetApp Data Compression, Deduplication, and Data Compaction

Rethinking Deduplication Scalability

Column Stores vs. Row Stores How Different Are They Really?

I-CASH: Intelligently Coupled Array of SSD and HDD

bup: the git-based backup system Avery Pennarun

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

The Google File System

White paper ETERNUS CS800 Data Deduplication Background

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Efficient Deduplication Techniques for Modern Backup Operation

DEDUPLICATION OF VM MEMORY PAGES USING MAPREDUCE IN LIVE MIGRATION

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Hedvig as backup target for Veeam

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Presented by: Nafiseh Mahmoudi Spring 2017

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

Byte Index Chunking Approach for Data Compression

Transcription:

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR

Overview Technics of data reduction in storage systems: Traditional compression Huffman coding and dictionary coding (e.g. GZIP) Data deduplication eliminate redundancy at the chunk/file level Delta compression removes redundancy among non-duplicate but very similar data files and chunks 2

Spot the difference A A Delta 3

Delta compression Source + Target Delta Δ Δ + Source Reverse Delta Target 4

Outline Background and motivation Design and implementation Performance evaluation Conclusions 5

Motivation Delta Compression Deduplication Target Similar Data Duplicate data Granularity String Chunk\File Scalability Weak Strong Delta compression can eliminate more redundancy among non-duplicate but similar chunks (about 2-3X) Uses of delta compression: Dropbox reduce bandwidth requirement by sending only the delta updates. I-CASH save space and enlarges logical space of SSD caches. Difference Engine save memory by sub page level sharing. 6

Delta algorithms Insert/delete delta algorithms use a longest common subsequence (LCS) algorithm, to compute an edit script that modifies the source version into the target version. Copy/insert locate matching offsets in the source and target, then emit a sequence of copy instructions for each matching range and insert instructions to cover the unmatched regions. Source= proxy cache Target= cache proxy Insert/delete: insert( cache ), retain(0,5), delete(5,6) Copy/insert : copy(6,5), insert( ), copy(0,5) 7

Copy/insert delta Building index Source Hash Offset 9dc6 8 b9b7 6 8

Copy/insert delta Searching for matches Target Hash Offset 9dc6 8 Source 9

Challenges Locating duplicate and similar data chunks and calculating the differences among similar data chunks. Byte-wise sliding window to identify matched strings is very time-consuming. Average delta encoding speed of similar chunks falls in the range of 30 90MB/s. 10

Outline Background and motivation Design and implementation Performance evaluation Conclusions 11

Approach Content Defined Chunking (CDC) dividing the base and input chunks into smaller independent strings and then detecting duplicates among these strings. Locality of redundant data regions immediately adjacent to the confirmed duplicate strings may contain duplicate content. 12

Design Gear-based chunking fast chunking using lookup table. Spooky fingerprinting to duplicate identification among strings. Greedy byte-wise scanning searching areas adjacent to duplicate strings to hopefully find more redundancy. Encoding encode the duplicate and non-duplicate as Copy and Insert instructions respectively. 13

Gear-based chunking Gear hash H i = H i 1 1 + GearTable B i GearTable - array of 256 random 32-bit integers Total: 1 ADD, 1 SHIFT, 1 ARRAY LOOKUP Rabin hash H i = H i 1 U B i n 8 B i T hash N U,T denote predefined arrays for finite field multiplication Total: 1 OR, 2 XORs, 2 SHIFTs, 2 ARRAY LOOKUPs 14

Gear-based chunking H i = H i 1 1 + GearTable 2 = 0 + 184 H i+1 = H i 1 + GearTable 7 = 840+537=377 H i+2 = H i+1 1 + GearTable 5 = 770+204=974 H i+3 = H i+2 1 + GearTable 9 = 740+519=259 H i+4 = H i+3 1 + GearTable 2 = 590+184=774 GearTable 0 6 1 512 2 184 3 174 4 342 5 204 6 679 7 537 8 925 9 519 15

Spooky fingerprinting 64-bit Spooky hash instead of time-consuming SHA-1 Compare content byte-wise (memcmp() in C language). Negligible overhead relative to chunking and fingerprinting. Other fast hash approaches like Murmur and xxhash can also be employed. 16

Greedy byte-wise scanning CDC-based approach cannot accurately find boundary between changed and duplicate regions. Exploiting data-stream content locality. Chunk-level search for resemblance-detected chunks. String-level search in the duplicate-adjacent areas. 17

Ddelta workflow Step 1: Scanning from both ends Step 2: Identifying duplicate strings 18

Ddelta workflow Step 3: Scanning areas adjacent to duplicates Step 4: Encoding delta chunk (C= Copy I= Insert ) 19

Post deduplication workflow System overview 20

Similarity detection Compute fingerprints over the chunk/file and select N smallest values. 6 fingerprints over a chunk. Combine those to super-fingerprints. 2 super-fingerprints (3 fingerprints each). Search the index for a match of super-fingerprint Choose BestFit or FirstFit strategy. FirstFit in our case. 21

Outline Background and motivation Design and implementation Performance evaluation Conclusions 22

Evaluation datasets GCC and Linux datasets represent workloads of typical large software source code. VM-A VM images of different OS release versions, low dedup-factor. VM-B 177 backups of an Ubuntu 12.04 VM in use, common use-case for data reduction in the real world. RDB 211 backups of Redis key value store database, typical database workload for data reduction. Bench is generated from snapshots of a personal cloud storage benchmark. 23

Experimental setup Data deduplication chunk sizes are: average 8 KB maximum 64 KB, and minimum 2 KB 1 2 Xdelta and Zdelta used as delta compression baseline Metrics: Compression ratio CR - percentage of data reduced Compression factor CF - ratio of data sizes before and after data reduction Platform: Ubuntu 12.04.2 OS quad-core Intel i7 processor at 2.8 GHz, with a 16GB RAM 2x1TB 7200RPM hard disks, 120GB SSD 1. File system support for delta compression, Department of EE and CS, University of California at Berkeley, 2000. J. MacDonald(Masters Thesis). 2. Zdelta: An efficient delta compression tool, Technical report, Department of CS at Polytechnic University, 2002. D. Trendafilov, N. Memon, T. Suel. 24

Gear hash evaluation Hash function distribution 25

Gear hash evaluation Chunk-size distribution on RDB 26

Gear hash evaluation Chunking speed Compression performance 27

Ddelta evaluation Post-deduplication data reduction system that implements delta and GZ compression on the nonduplicate chunks. Case study I: delta compression of resemblancedetected similar chunks Case study II: delta compression for updated tarred files 28

Ddelta evaluation CR by the three duplicateidentification steps of Ddelta, different workloads. CR as a function of the average string size on the Linux dataset. 29

Ddelta evaluation Encoding speed as a function of the average string size, Linux dataset. Evaluating combinations of chunking schemes with fingerprinting schemes. 30

Ddelta evaluation Encoding speed as a function of the average string size, Linux dataset. Evaluating combinations of chunking schemes with fingerprinting schemes. 31

Ddelta evaluation CR of post-deduplication data reduction schemes 32

Ddelta evaluation Compressing throughput 33

Ddelta evaluation Uncompressing throughput 34

Ddelta evaluation 2 Delta compression performance of the updated similar tarred files. 35

Ddelta evaluation 2 CR of Ddelta, Xdelta and deduplication on similar tarred datasets Encoding speed Decoding speed 36

Outline Background and motivation Design and implementation Performance evaluation Conclusions 37

Conclusions Delta compression scheme can be fast. Encoding speedup of x2.5 - x8 Decoding speedup of x2 - x20 Using deduplication principles without sacrifice to compression ratio. Gear-based chunking improves Rabin Content- Defined Chunking process by a factor of about x2.1 38

Ongoing questions Similarity detection? DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads. How GC will manage delta compressed files? Inline or Offline? Write/Read throughput? 39