A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR

Overview Technics of data reduction in storage systems: Traditional compression Huffman coding and dictionary coding (e.g. GZIP) Data deduplication eliminate redundancy at the chunk/file level Delta compression removes redundancy among non-duplicate but very similar data files and chunks 2

Spot the difference A A Delta 3

Delta compression Source + Target Delta Δ Δ + Source Reverse Delta Target 4

Outline Background and motivation Design and implementation Performance evaluation Conclusions 5

Motivation Delta Compression Deduplication Target Similar Data Duplicate data Granularity String Chunk\File Scalability Weak Strong Delta compression can eliminate more redundancy among non-duplicate but similar chunks (about 2-3X) Uses of delta compression: Dropbox reduce bandwidth requirement by sending only the delta updates. I-CASH save space and enlarges logical space of SSD caches. Difference Engine save memory by sub page level sharing. 6

Delta algorithms Insert/delete delta algorithms use a longest common subsequence (LCS) algorithm, to compute an edit script that modifies the source version into the target version. Copy/insert locate matching offsets in the source and target, then emit a sequence of copy instructions for each matching range and insert instructions to cover the unmatched regions. Source= proxy cache Target= cache proxy Insert/delete: insert( cache ), retain(0,5), delete(5,6) Copy/insert : copy(6,5), insert( ), copy(0,5) 7

Copy/insert delta Building index Source Hash Offset 9dc6 8 b9b7 6 8

Copy/insert delta Searching for matches Target Hash Offset 9dc6 8 Source 9

Challenges Locating duplicate and similar data chunks and calculating the differences among similar data chunks. Byte-wise sliding window to identify matched strings is very time-consuming. Average delta encoding speed of similar chunks falls in the range of 30 90MB/s. 10

Outline Background and motivation Design and implementation Performance evaluation Conclusions 11

Approach Content Defined Chunking (CDC) dividing the base and input chunks into smaller independent strings and then detecting duplicates among these strings. Locality of redundant data regions immediately adjacent to the confirmed duplicate strings may contain duplicate content. 12

Design Gear-based chunking fast chunking using lookup table. Spooky fingerprinting to duplicate identification among strings. Greedy byte-wise scanning searching areas adjacent to duplicate strings to hopefully find more redundancy. Encoding encode the duplicate and non-duplicate as Copy and Insert instructions respectively. 13

Gear-based chunking Gear hash H i = H i 1 1 + GearTable B i GearTable - array of 256 random 32-bit integers Total: 1 ADD, 1 SHIFT, 1 ARRAY LOOKUP Rabin hash H i = H i 1 U B i n 8 B i T hash N U,T denote predefined arrays for finite field multiplication Total: 1 OR, 2 XORs, 2 SHIFTs, 2 ARRAY LOOKUPs 14

Gear-based chunking H i = H i 1 1 + GearTable 2 = 0 + 184 H i+1 = H i 1 + GearTable 7 = 840+537=377 H i+2 = H i+1 1 + GearTable 5 = 770+204=974 H i+3 = H i+2 1 + GearTable 9 = 740+519=259 H i+4 = H i+3 1 + GearTable 2 = 590+184=774 GearTable 0 6 1 512 2 184 3 174 4 342 5 204 6 679 7 537 8 925 9 519 15

Spooky fingerprinting 64-bit Spooky hash instead of time-consuming SHA-1 Compare content byte-wise (memcmp() in C language). Negligible overhead relative to chunking and fingerprinting. Other fast hash approaches like Murmur and xxhash can also be employed. 16

Greedy byte-wise scanning CDC-based approach cannot accurately find boundary between changed and duplicate regions. Exploiting data-stream content locality. Chunk-level search for resemblance-detected chunks. String-level search in the duplicate-adjacent areas. 17

Ddelta workflow Step 1: Scanning from both ends Step 2: Identifying duplicate strings 18

Ddelta workflow Step 3: Scanning areas adjacent to duplicates Step 4: Encoding delta chunk (C= Copy I= Insert ) 19

Post deduplication workflow System overview 20

Similarity detection Compute fingerprints over the chunk/file and select N smallest values. 6 fingerprints over a chunk. Combine those to super-fingerprints. 2 super-fingerprints (3 fingerprints each). Search the index for a match of super-fingerprint Choose BestFit or FirstFit strategy. FirstFit in our case. 21

Outline Background and motivation Design and implementation Performance evaluation Conclusions 22

Evaluation datasets GCC and Linux datasets represent workloads of typical large software source code. VM-A VM images of different OS release versions, low dedup-factor. VM-B 177 backups of an Ubuntu 12.04 VM in use, common use-case for data reduction in the real world. RDB 211 backups of Redis key value store database, typical database workload for data reduction. Bench is generated from snapshots of a personal cloud storage benchmark. 23

Experimental setup Data deduplication chunk sizes are: average 8 KB maximum 64 KB, and minimum 2 KB 1 2 Xdelta and Zdelta used as delta compression baseline Metrics: Compression ratio CR - percentage of data reduced Compression factor CF - ratio of data sizes before and after data reduction Platform: Ubuntu 12.04.2 OS quad-core Intel i7 processor at 2.8 GHz, with a 16GB RAM 2x1TB 7200RPM hard disks, 120GB SSD 1. File system support for delta compression, Department of EE and CS, University of California at Berkeley, 2000. J. MacDonald(Masters Thesis). 2. Zdelta: An efficient delta compression tool, Technical report, Department of CS at Polytechnic University, 2002. D. Trendafilov, N. Memon, T. Suel. 24

Gear hash evaluation Hash function distribution 25

Gear hash evaluation Chunk-size distribution on RDB 26

Gear hash evaluation Chunking speed Compression performance 27

Ddelta evaluation Post-deduplication data reduction system that implements delta and GZ compression on the nonduplicate chunks. Case study I: delta compression of resemblancedetected similar chunks Case study II: delta compression for updated tarred files 28

Ddelta evaluation CR by the three duplicateidentification steps of Ddelta, different workloads. CR as a function of the average string size on the Linux dataset. 29

Ddelta evaluation Encoding speed as a function of the average string size, Linux dataset. Evaluating combinations of chunking schemes with fingerprinting schemes. 30

Ddelta evaluation Encoding speed as a function of the average string size, Linux dataset. Evaluating combinations of chunking schemes with fingerprinting schemes. 31

Ddelta evaluation CR of post-deduplication data reduction schemes 32

Ddelta evaluation Compressing throughput 33

Ddelta evaluation Uncompressing throughput 34

Ddelta evaluation 2 Delta compression performance of the updated similar tarred files. 35

Ddelta evaluation 2 CR of Ddelta, Xdelta and deduplication on similar tarred datasets Encoding speed Decoding speed 36

Outline Background and motivation Design and implementation Performance evaluation Conclusions 37

Conclusions Delta compression scheme can be fast. Encoding speedup of x2.5 - x8 Decoding speedup of x2 - x20 Using deduplication principles without sacrifice to compression ratio. Gear-based chunking improves Rabin Content- Defined Chunking process by a factor of about x2.1 38

Ongoing questions Similarity detection? DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads. How GC will manage delta compressed files? Inline or Offline? Write/Read throughput? 39