Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble Work done at Hewlett-Packard laboratories 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
The Problem: deduplication at scale for disk-to-disk backup
A Disk-to-Disk Backup Scenario D2D server streaming data (fake tape library) Each tape: up to 400 GB Total: 100 TB 10 PB 3 February 26, 2009
Example backup streams Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 4 February 26, 2009
Little changes from day-to-day Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 5 February 26, 2009
After ideal deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 6 February 26, 2009
Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 7 February 26, 2009
Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 8 February 26, 2009
Chunk-based deduplication Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 9 February 26, 2009
The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 10 February 26, 2009
The standard implementation index: RAM Disk chunk store: H(B) H(A) H(C) A B C 11 February 26, 2009 Chunk-lookup disk bottleneck
One existing solution Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Benjamin Zhu, Data Domain, Inc.; Kai Li, Data Domain, Inc., and Princeton University; Hugo Patterson, Data Domain, Inc. FAST 08. Today: a new approach that uses significantly less RAM provides a guaranteed minimum throughput 12 February 26, 2009
Our Approach: Sparse indexing
Sparse indexing Key ideas: Chunk locality Sampling 14 February 26, 2009
No temporal locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 15 February 26, 2009
Large sections of data reappear mostly intact chunk locality Monday: file B file D file E Tuesday: file D file E Wednesday: file X file D file E 16 February 26, 2009
Exploiting chunk locality file B file D file X file D 17 February 26, 2009
Divide into segments 2 3 4 5 file B file D new file X file D Chunks not shown, real segments much longer 18 February 26, 2009
Deduplicate one segment at a time 2 3 4 5 file B file D new file X file D Against a few carefully chosen champion segments 19 February 26, 2009
Champion #1: the most similar segment 2 3 4 5 file B file D new file X file D 20 February 26, 2009
Champion #2: most similar to remainder 2 3 4 5 file B file D new file X file D 21 February 26, 2009
Finding similar segments by sampling 2 3 4 5 file B file D file X file D Sparse Index: samples containing segment(s) 22 February 26, 2009
A few details Also keep segment recipes: list of pointers to a segment s chunks Actually deduplicate against champion recipes Better with variable-sized segments boundaries based on landmarks ( superchunks ) reduces number of champions required 23 February 26, 2009
Putting it all together byte stream Chunker chunks Segmenter segments Champion samples chooser segment IDs champion IDs Deduplicator Sparse index updates champion recipes new recipes new chunks (compressed) Segment recipes Disk storage Chunk data files 24 February 26, 2009
Results
Methodology Built a simulator Fixed parameters: 4 KB mean chunk size variable-size segments maximum of 1 segment ID kept per sample Varying parameters: mean segment size sampling rate maximum number of champions per segment (M) 26 February 26, 2009
The data sets Workgroup [this talk] 3.8 TB backups of 20 desktop PCs belonging to engineers semi-regular backups over 3 months via tar 154 full backups and 392 incremental backups end-of-week full backups are synthetic SMB [see paper] backups of a server with real Oracle data synthetic Microsoft Exchange data two weeks 0.6 TB 27 February 26, 2009
Chunk locality exists 28 February 26, 2009
Sampling can exploit most of it (10 MB mean segment size) 29 February 26, 2009
Deduplication with at most 10 champions 30 February 26, 2009
Deduplication depends primarily on 31 February 26, 2009
Index RAM usage 1/128 1/64 1/32 32 February 26, 2009
Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index 33 February 26, 2009
Comparison with Zhu, et al. Their chunk lookup: bloom filter: might the store have a copy? cache of chunk container indexes full on disk index When chunk locality is poor, deduplication quality remains constant but throughput degrades Find all duplicate chunks but larger chunk size 34 February 26, 2009
Ram usage comparison 1/128 1/64 1/32 4 KB 8 KB 16 KB 32 KB 35 February 26, 2009
What about all those disk accesses? Infrequent due to batch processing Example: load at most 10 champions per 10 MB segment average of 1.7 champions per 10 MB segment = 0.17 champions/mb = 1 seek per 5 MB I/O burden: 20 ms to load a champion recipe (~100 KB) 1 drive can handle > 250 MB/s ingestion rate 36 February 26, 2009
37 February 26, 2009 Thank You