Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble Work done at Hewlett-Packard laboratories 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

The Problem: deduplication at scale for disk-to-disk backup

A Disk-to-Disk Backup Scenario D2D server streaming data (fake tape library) Each tape: up to 400 GB Total: 100 TB 10 PB 3 February 26, 2009

Example backup streams Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 4 February 26, 2009

Little changes from day-to-day Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 5 February 26, 2009

After ideal deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 6 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 7 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 8 February 26, 2009

Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 9 February 26, 2009

The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 10 February 26, 2009

The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 11 February 26, 2009 Chunk-lookup disk bottleneck

One existing solution Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Benjamin Zhu, Data Domain, Inc.; Kai Li, Data Domain, Inc., and Princeton University; Hugo Patterson, Data Domain, Inc. FAST 08. Today: a new approach that uses significantly less RAM provides a guaranteed minimum throughput 12 February 26, 2009

Our Approach: Sparse indexing

Sparse indexing Key ideas: Chunk locality Sampling 14 February 26, 2009

No temporal locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 15 February 26, 2009

Large sections of data reappear mostly intact chunk locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 16 February 26, 2009

Exploiting chunk locality file B file D file X file D 17 February 26, 2009

Divide into segments 2 3 4 5 file B file D new file X file D Chunks not shown, real segments much longer 18 February 26, 2009

Deduplicate one segment at a time 2 3 4 5 file B file D new file X file D Against a few carefully chosen champion segments 19 February 26, 2009

Champion #1: the most similar segment 2 3 4 5 file B file D new file X file D 20 February 26, 2009

Champion #2: most similar to remainder 2 3 4 5 file B file D new file X file D 21 February 26, 2009

Finding similar segments by sampling 2 3 4 5 file B file D file X file D Sparse Index: samples containing segment(s) 22 February 26, 2009

A few details Also keep segment recipes: list of pointers to a segment s chunks Actually deduplicate against champion recipes Better with variable-sized segments boundaries based on landmarks ( superchunks ) reduces number of champions required 23 February 26, 2009

Putting it all together byte stream Chunker chunks Segmenter segments Champion samples chooser segment IDs champion IDs Deduplicator Sparse index updates champion recipes new recipes new chunks (compressed) Segment recipes Disk storage Chunk data files 24 February 26, 2009

Results

Methodology Built a simulator Fixed parameters: 4 KB mean chunk size variable-size segments maximum of 1 segment ID kept per sample Varying parameters: mean segment size sampling rate maximum number of champions per segment (M) 26 February 26, 2009

The data sets Workgroup [this talk] 3.8 TB backups of 20 desktop PCs belonging to engineers semi-regular backups over 3 months via tar 154 full backups and 392 incremental backups end-of-week full backups are synthetic SMB [see paper] backups of a server with real Oracle data synthetic Microsoft Exchange data two weeks 0.6 TB 27 February 26, 2009

Chunk locality exists 28 February 26, 2009

Sampling can exploit most of it (10 MB mean segment size) 29 February 26, 2009

Deduplication with at most 10 champions 30 February 26, 2009

Deduplication depends primarily on 31 February 26, 2009

Index RAM usage 1/128 1/64 1/32 32 February 26, 2009

Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index 33 February 26, 2009

Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index When chunk locality is poor, deduplication quality remains constant but throughput degrades Find all duplicate chunks but larger chunk size 34 February 26, 2009

Ram usage comparison 1/128 1/64 1/32 4 KB 8 KB 16 KB 32 KB 35 February 26, 2009

What about all those disk accesses? Infrequent due to batch processing Example: load at most 10 champions per 10 MB segment average of 1.7 champions per 10 MB segment = 0.17 champions/mb = 1 seek per 5 MB I/O burden: 20 ms to load a champion recipe (~100 KB) 1 drive can handle > 250 MB/s ingestion rate 36 February 26, 2009

37 February 26, 2009 Thank You