The Logic of Physical Garbage Collection in Deduplicating Storage

Size: px

Start display at page:

Download "The Logic of Physical Garbage Collection in Deduplicating Storage"

Amber Elliott
6 years ago
Views:

1 The Logic of Physical Garbage Collection in Deduplicating Storage Fred Douglis Abhinav Duggal Philip Shilane Tony Wong Dell EMC Shiqin Yan University of Chicago Fabiano Botelho Rubrik 1

2 Deduplication in Data Domain Filesystem (DDFS) Fingerprint Index File 1 File 2 fp CID R S T W W X Y Z R C1 Variable sized chunks Variable sized chunks S C1 Generate fingerprints Generate fingerprints R S T W W X Y Z T C2 R fp S fp T fp W fp W fp X fp Y fp Z fp W C2 Containers holding chunks C1 R S C3 X Y X C3 Y C3 C2 T W C4 Z Z C4 2

3 File Representation in DDFS COPY fastcopy creates new root into same tree L 3 L 4 L 5 L 6 L 5 Files represented as a Merkle tree of fingerprints L 6 Lp chunks (metadata) L 2 L 2 L 1 : R fp S fp T fp U fp V fp W fp X fp Y fp L 1 : R fp S fp Z fp R Y S L 0 : Chunks stored on disk in containers 3

4 Deduplication Workloads on Data Domain Traditional backups Weekly full and daily incremental backups Full backups tend to be very large 100GBs to TBs Much content in full backups repeats previous full Typically, 10-20x total compression (TC) 20x TC = 10x dedup and 2x compression New workloads Synthetic full backups Send changes and a recipe to create a single full backup from some previous backup Daily fulls High TC (100x-400x or higher) High file count 100M to 1 billion small files 4

to improve throughput T W C2 C2 X C3 C1 Containers holding chunks R S

5 Garbage Collection in a Deduplication Filesystem Fingerprint Index File 1 File 2 File 3 fp CID R C1 S C1 Duplicates are sometimes written to improve throughput T W C2 C2 X C3 C1 Containers holding chunks R S C3 X Y Y Z C3 C4 C2 T W C4 Z Duplicate chunk Q C5 C5 Q Y Y C5 Shared chunk 5

6 Evolution of GC in DDFS Logical GC (LGC) Depth-first traversal of per-file Merkle tree on disk to mark live chunks in memory In-memory data structures may not allow system to track all chunks, so an extra mark phase ( pre-phases ) is used when necessary Physical GC (PGC) Breadth-first traversal of the physical layout of Merkle trees to mark live chunks in memory Similar to LGC, pre-phases may be needed Phase-optimized Physical GC (PGC+) Improvement over PGC by removing pre-phases, plus other optimizations 6

Logical GC Phases Merge Merge in-memory Index on disk Enumeration Depth-first walk and mark live chunks in an in-memory Bloom filter called live vector Filter Create live instance vector (also a

7 Logical GC Phases Merge Merge in-memory Index on disk Enumeration Depth-first walk and mark live chunks in an in-memory Bloom filter called live vector Filter Create live instance vector (also a Bloom filter) from live vector to remove the duplicates Select Select best containers to compact Copy Copy live chunks from selected containers into new containers and delete old containers Mark phase Sweep phase 7

8 Enumeration Phase (Logical GC) F1 F1 L6 L6 L2 shared L1 L1 L2 L1 Only L p chunks are traversed L0 L0 8

9 Logical GC àphysical GC Logical enumeration performance is sensitive to the following parameters Total compression factor Number of small files Spatial locality of L p Physical GC addresses these performance issues 9

10 Physical GC (PGC) Uses breadth-first walk instead of per-file depth-first walk during enumeration Uses Perfect Hash Vector(PHV) to store L P s for assisting the breadth-first walk Uses less memory Needed for doing checksums to prevent corruption New analysis phase to build Perfect Hash Functions for L P s Remaining phases are same as logical GC LGC PGC Live vector Live instance vector Walk Vector Live vector Live instance vector Bloom filters PHV Bloom filters 10

11 Collision Free - Perfect Hashing Vector (PH vec ) 0 1 n - 1 s 1 s 2 s n Fingerprint set S PHF (m n) Collision-free hash function which maps a fingerprint to a unique position in a bit vector m - 1 Bit vector 11

12 Analysis Phase On-disk container index FP CID type fp 1 10 L 0 fp 2 5 L P fp 3 30 L P fp n 40 In-memory Perfect Hash functions of Lp #fps 12

13 Benefits & Costs of Physical Enumeration Pro: Sequential scan of containers on disk All L 6, then all L 5, down to L 1 s Relatively few containers store high-level metadata No need to keep revisiting same L p containers due to fastcopy (high deduplication) Con: extra analysis cost doesn t help traditional workloads and due to pre-phases we may have to run analysis twice! 13

LGC and PGC phases (including pre-phases) Logical GC 1. Pre-merge 2. Pre-enumeration 3. Pre-filter 4. Pre-select 5. Candidate 6. Enumeration 7. Merge 8. Filter 9. Copy 10.

14 LGC and PGC phases (including pre-phases) Logical GC 1. Pre-merge 2. Pre-enumeration 3. Pre-filter 4. Pre-select 5. Candidate 6. Enumeration 7. Merge 8. Filter 9. Copy 10. Summary Prephases/sampli ng phases Physical GC 1. Pre-merge 2. Pre-analysis 3. Pre-enumeration 4. Pre-filter 5. Pre-select 6. Merge 7. Analysis 8. Candidate 9. Enumeration 10. Filter 11. Copy 12. Summary Pre-phases / sampling phases 14

15 Physical GC à Phase-optimized Physical GC Limitations of Physical GC Adds 2 extra phases (pre-analysis and analysis) Slightly degrades GC performance for customers with traditional backup workloads Motivation for Phase-optimized Physical GC (PGC + ) Avoid pre-phases by representing all chunks in memory Can we use Perfect hash as a live vector? Need only 2.7 bits per fingerprint instead of a 6 bits in Bloom filter Can we maintain duplicate recipe without using a Bloom filter? Get 50% memory back Walk Vector PGC Live vector Live instance vector Walk Vector PGC + Live vector PHV Bloom filters PHV PHV 15

16 Phase-optimized Physical GC (PGC+) Phases 1. Merge 2. Analysis 3. Enumeration 4. Select 5. Copy 6. Summary 16

17 PGC+ Analysis and Enumeration Replace Bloom filter with Perfect Hash vector for tracking live and dead chunks In analysis phase build two Perfect hash vectors Lp vector called the walk vector (similar to PGC) All fingerprints(lp + L0) based Perfect Hash vector called live vector Perfect hashing optimizations NUMA-aware Perfect Hashing Cache prefetching of Perfect hash functions and values in the Perfect Hash Vector 17

PGC + Copy phase Dynamically remove duplicates during Copy phase C1 C2 fp1, fp2 fp1, fp3 fp1 fp2 fp3 1 1 1 Initial state Live vector

18 PGC + Copy phase Dynamically remove duplicates during Copy phase C1 C2 fp1, fp2 fp1, fp3 fp1 fp2 fp Initial state Live vector C1 C2 fp1 fp2 fp3 fp1, fp2 fp1, fp Process C2 Live vector C1 fp1, fp2 C2 fp1, fp3 fp1 fp2 fp Process C1 18 Live vector

19 Evaluation Deployed systems Comparison of GC runs for systems upgraded from LGC to PGC Controlled experiments on 4 systems Comparison of LGC vs PGC vs PGC + One phase versus two phase GC DD860 used as default for all experiments Workload used was Synthetic dataset similar to some past deduplication work (e.g., Botelho, et al., FAST 2012) Systems DD2500 DD860 DD890 DD990 CPU(cores*GHz) 8*2.2 GHz 16*2.53 GHz 24*2.8 GHz 40*2.4 GHz Mem(GB) 64 GB 70 GB 94 GB 256 GB Physical Capacity (TB) 122 TB 126 TB 167 TB 319 TB 19

20 Deployed System Results- LGC vs PGC For high TC workloads, PGC improved from LGC up to 20x For high file count workload, PGC improved over LGC by 7x 75% of systems upgraded from LGC to PGC suffered from some degradation but usually not much Hard to compare LGC v/s PGC systems because of some other performance changes introduced with PGC Lab experiments to compare all GC variants with same performance parameters 20

21 GC on Different Platforms (36.6x TC) For this dedup, LGC2 is slightly better than PGC2 but PGC+ is better than LGC2/PGC2 21

22 High Total compression Workload Duration (hours) LGC2 LGC1 PGC2 PGC1 PGC + LGC duration scales with TC LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC x 73.2x 147x 293x 586x 1170x 2340x Total compression factor (TC) PGC/PGC+ remain flat 22

Duration (hours) High file Count Workload 100 80 60 40 LGC1/LGC2 is orders of magnitude

23 Duration (hours) High file Count Workload LGC1/LGC2 is orders of magnitude slower than PGC 187 LGC2 LGC1 PGC2 PGC1 PGC LGC PGC PGC + High file count(900m) 23

24 Conclusions Shift in workloads required moving from depth-first based mark phase to breadth-first based mark phase PGC works better than LGC for very high TC datasets and large number of small files Due to extra phases and performance constraints introduced in PGC, PGC is not uniformly faster than LGC PGC+ uses various optimizations to improve over PGC, primarily by avoiding multiple mark phases PGC+ is significantly faster than LGC when 2 mark phases are required and orders of magnitude faster for problematic workloads 24

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, & Windsor Hsu Backup Recovery Systems Division EMC Corporation Introduction