Deduplication Storage System Kai Li Charles Fitzmorris Professor, Princeton University & Chief Scientist and Co-Founder, Data Domain, Inc. 03/11/09
The World Is Becoming Data-Centric CERN Tier 0 Business & personal life becoming digital Make better decisions Science e-science causing data explosion Accelerate science discovery 03/11/09 2
How Much Information(IDC Estimates in 2007)? Data size is on the way to zetabytes (10 9 TB) 161 exabytes (10 8 TB) of information was created or replicated worldwide in 2006 IDC estimates 6X growth by 2010 to 988 exabytes (a zetabyte) / year New technical information 2X every 2 years Examples 7 Wal-Mart (US) 500 TB, 10 transactions / day in 2004 Google s BigTable (US) 1-2 petabytes AT&T has 11 exabytes (10 7 TB) of wireline, wireless, Internet data 03/11/09 3
Challenges in A Digital World Find information Search engines do well for text document data Audio, images, video, sensor data are much more difficult Open problems in analysis, visualization, summarization Access data Cloud Computing and P2P address location and management issues But, $/bandwidth vary and improve slowly Store and protect data Protect all versions of data frequently Recover any version of data quickly 03/11/09 4
A Traditional Data Center $$$$ Mirrored storage $$$ Clients Server Primary storage 03/11/09 5
A Data Center w/ Networked Disk Storage? Dedicated Fibre Mirrored storage Clients Server Primary storage Costs Purchase: $$$$ Bandwidth: $$$$$ Power: $$$ 20 primary (3-month retention) Onsite WAN 20 primary Remote 03/11/09 6
Trends for Bytes/$ Storage increases exponentially Gap between adjacent classes is about 3-5X Bandwidth becomes flat Low-end WAN: slow improvements High-end WAN: almost no improvement Mbytes Tape storage ATA storage FC storage Flash storage WAN BW 1998 2008 03/11/09 7
Replication with WAN Is Expensive A T3 line example T3 is 45Mbits/sec or ~486GB/day T3 cost is about $72k/year Policies Dataset size Cost Daily full backups ~500GB $300/GB/ 2 Years Daily incremental & weekly full backups ~3TB $48/GB/ 2 Years The problem Bandwidth costs 4x 20x data center primary storage WAN Mbytes/$ increases slowly (<< data growth rate) Situation will get worse 03/11/09 8
A Data Center with Deduplication Storage Mirrored storage Clients Server Primary storage Promises Purchase: ~Tape libraries Space: >10X reduction WAN BW: >10X reduction Power: >10X reduction Onsite WAN Remote 03/11/09 9
High-Level Idea of Deduplication Traditional local compression Deduplication Encode in a sliding window of bytes (e.g. 100K) ~2 compression ~10-50 compression Large window more redundant data 03/11/09 10
Backup Data Example View from Backup Software (tar or similar format) Data Stream First Full Backup Incr 1 Incr 2 Second Full Backup A B C D A E F G A B H A E I B J C D E F G H Deduplicated Storage: Redundancies pooled, compressed A B C D E F G H I J = Unique variable segments = Redundant data segments = Compressed unique segments 03/11/09 11
Two Deduplication Approaches Deltas Computing a sketch for each segment [Broder97] Find a similar segment by comparing sketches Compute deltas with the most similar segment But, need to read segments from disks Fingerprinting Computing a fingerprint as ID for each segment Use an index to lookup if the segment is a duplicate Efficiency depends on index lookups 03/11/09 12
Two approaches Segmentation Fixed: simple but cannot handle shifts well Content-based variable: independent of shifts A X C D A Y C D A B C D A B C D Segment size... Rolling fingerprinting until fp = xxxx0000 Smaller higher compression Smaller more memory, less throughput Example: 100GB index / 20TB for 4K segment size 03/11/09 13
Main Components Interfaces (NFS, CIFS, VTL, ) Object-Oriented File System Deduplication GC & Verification Data Layout Replication RAID (e.g. RAID-6) Disk Shelves 03/11/09 14
Design Challenges Very reliable and self-healing Backup data is the last stop Multi-dimensional verifications and self-healing High throughput + high compression at low cost Why high throughput: a day has 24 hours Why high compression: make disks cost like tapes Controller cost should be small 03/11/09 15
Typical Alternatives Caching the index? File buffer cache has low hit rate (<80%) Fingerprints are random Parallelize the index? 7200 RPM Seagate 1TB drive: <120 seeks/sec 120 lookups/s @ 8KB segment: 0.96 MB/sec/disk Need 200 disks to achieve 200 MB/s! [Venti02] Buffer the data? Need a very large disk buffer Long delay to move data offsite A day still has 24 hours 03/11/09 16
High Throughput, High Compression Main ideas at Low Cost Layout data on disk with duplicate locality Use a sophisticated cache for the fingerprint index Use a summary data structure for new data Use locality-preserved caching for old data See a FAST paper for details Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk Bottleneck in a Deduplication Storage System. In Proceedings of The 6 th USENIX Conference on File and Storage Technologies (FAST 08). February 2008 03/11/09 17
Summary Vector Goal: Use minimal memory to test for new data Summarize what segments have been stored, with Bloom filter (Bloom 70) in RAM If Summary Vector says no, it s new segment 1 1 1 Summary Vector Approximation Set bits fp(s i ) h 1 h 2 h 3 1 0 1 1 Index Data Structure Read bits fp(s i ) h 1 h 2 h 3 03/11/09 18
Known Analysis Results Bloom filter with m bits k independent hash functions After inserting n keys, the probability of a false positive is: Examples: m/n = 6, k = 4: p = 0.0561 m/n = 8, k = 6: p = 0.0215 p = 1 1 1 m ( kn / m 1 e ) k Experimental data validate the analysis results kn 03/11/09 19 k
Stream Informed Segment Layout Goal: Capture duplicate locality on disk Segments from the same stream are stored in the same containers Metadata (index data) are also in the containers Metadata section Metadata section Metadata section Data section Data section Data section Stream 1 Stream 2 Stream 3 03/11/09 20
Locality Preserved Caching (LPC) Goal: Maintain duplicate locality in the cache Disk Index has all <fingerprint, containerid> pairs Index Cache caches a subset of such pairs On a miss, lookup Disk Index to find containerid Load the metadata of a container into Index Cache, replace if needed Miss Disk Index ContainerID Metadata Data Load metadata Index Cache Replacement Container 03/11/09 21
Putting Them Together A fingerprint Duplicate New Index Cache No Summary Vector Replacement Maybe Disk Index metadata metadata metadata datametadata data data data 03/11/09 22
Evaluation What to evaluate Disk I/O reduction results Drive the evaluation with two real datasets Observe results at different parts of the system Write and read throughput A synthetic benchmark with multiple streams Mimic multiple backups Deduplication results Report results from two data centers Platform 2 qardcore 2.8Ghz Intel CPUs, 8GB RAM, 10GE NIC, 1GB NVRAM, 15 7,200 RPM ATA disks 03/11/09 23
Disk I/O Reduction Results Exchange data (2.56TB) 135-daily full backups Engineering data (2.39TB) 100-day daily inc, weekly full # disk I/Os % of total # disk I/Os % of total No summary, No SISL/LPC 328,613,503 100.00% 318,236,712 100.00% Summary only 274,364,788 83.49% 259,135,171 81.43% SISL/LPC only 57,725,844 17.57% 60,358,875 18.97% Summary & SISL/LPC 3,477,129 1.06% 1,257,316 0.40% 03/11/09 24
Write Throughput MB/s Generations G e Platform: 2xQuadcore Xeon+15 disks+10ge 03/11/09 25
Read Throughput MB/s Generations G e Platform: 2xQuadcore Xeon+15 disks+10ge 03/11/09 26
Real World Example at Datacenter A 03/11/09 27
Real World Compression at Datacenter A 03/11/09 28
Real World Example at Datacenter B 03/11/09 29
Real World Compression at Datacenter B 03/11/09 30
Local compression Related Work Relatively small windows of bytes [Ziv&Lempel77, ] Larger windows [Bentley&McILroy99] File-level deduplication system Use file hashes to detect duplicate files CAS systems, not addressing throughput issues Fixed-size block deduplication storage prototype [Venti02] Fixed-size segments, not addressing throughput issues Segment-level deduplication Content-based segmentation methods [Manber93, Brin94] Variable-size segment deduplication for network traffic [Spring00, LBFS01, TAPER05, ] Variable-size segment vs. delta deduplication [Kulkarni04] Use of Bloom filters Summary data structure [Bloom70, Fan98, Broder02] Detect duplicates [TAPER05] 03/11/09 31
Summary Deduplication storage replaces tape library Purchase cost: < tape library solution Space reduction: 10 30x Bandwidth reduction: 10-100x Power reduction: > 10x (~compression ratio) Scalable deduplication NFS throughput ~110MB/s on 2 2.6Ghz CPUs w/ 15 disks ~210MB/s on 2 2-core 3Ghz CPUs w/ 15 disks ~360MB/s on 2 4-core 2.8Ghz CPUs w/ 15 disks (~750MB/s for OST interface) Impact Over 10,000 systems deployed to many data centers >70% replicate data over WAN for disaster recovery See white papers at http://www.datadomain.com 03/11/09 32
Beyond the Backup Use Case Nearline storage (widely in use already) Besides throughput, optimize for file IO s/sec 4-10X compression, depending on data Archiving (beginning) Lock, shredding, encryption, 4-10X compression, depending on data Future issues FLASH + Dedup storage New storage eco-system for data centers Storage infrastructure for Cloud Computing 03/11/09 33