Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Similar documents
Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

DEC: An Efficient Deduplication-Enhanced Compression Approach

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Deploying De-Duplication on Ext4 File System

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Chapter 4 Memory Management. Memory Management

The Logic of Physical Garbage Collection in Deduplicating Storage

A study of practical deduplication

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

The Effectiveness of Deduplication on Virtual Machine Disk Images

Last Class: Demand Paged Virtual Memory

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

A Framework for Memory Hierarchies

SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers

Cumulus: Filesystem Backup to the Cloud

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

The Google File System

Technology Insight Series

DASH COPY GUIDE. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31

Deduplication Storage System

CS 5523 Operating Systems: Memory Management (SGG-8)

OPERATING SYSTEM. Chapter 9: Virtual Memory

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems

Can t We All Get Along? Redesigning Protection Storage for Modern Workloads

Purity: building fast, highly-available enterprise flash storage from commodity components

Chapter 4 Memory Management

Deduplication and Its Application to Corporate Data

Speeding Up Cloud/Server Applications Using Flash Memory

Oasis: An Active Storage Framework for Object Storage Platform

Memory Management. Chapter 4 Memory Management. Multiprogramming with Fixed Partitions. Ideally programmers want memory that is.

Alternative Approaches for Deduplication in Cloud Storage Environment

Modeling Page Replacement: Stack Algorithms. Design Issues for Paging Systems

How to Reduce Data Capacity in Objectbased Storage: Dedup and More

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING

DUE to the explosive growth of the digital data, data

JOURNALING techniques have been widely used in modern

Caching and Buffering in HDF5

Data Containers. User Guide

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Veeam and HP: Meet your backup data protection goals

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving

Readings and References. Virtual Memory. Virtual Memory. Virtual Memory VPN. Reading. CSE Computer Systems December 5, 2001.

Don t stack your Log on my Log

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Google File System. Arun Sundaram Operating Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Virtual Memory COMPSCI 386

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

Flash Memory Based Storage System

SMORE: A Cold Data Object Store for SMR Drives

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Contents Part I Traditional Deduplication Techniques and Solutions Introduction Existing Deduplication Techniques

Chapter 11. SnapProtect Technology

Chapter 8: Virtual Memory. Operating System Concepts

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Google Disk Farm. Early days

Chapter 9: Virtual Memory. Operating System Concepts 9th Edition

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

Chapter 10: Virtual Memory

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

LOAD BALANCING AND DEDUPLICATION

Operating System Concepts 9 th Edition

COMP 530: Operating Systems File Systems: Fundamentals

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

Optimizing Flash-based Key-value Cache Systems

FStream: Managing Flash Streams in the File System

Operating Systems Lecture 6: Memory Management II

Basic Memory Management

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Data Reduction Meets Reality What to Expect From Data Reduction

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

Data Deduplication Methods for Achieving Data Efficiency

Cheetah: An Efficient Flat Addressing Scheme for Fast Query Services in Cloud Computing

The Google File System

Operating Systems. Operating Systems Sina Meraji U of T

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

Reducing Replication Bandwidth for Distributed Document Databases

A File-System-Aware FTL Design for Flash Memory Storage Systems

Transcription:

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen *, Wen Xia, Fangting Huang, Qing Liu Huazhong University of Science and Technology Virginia Commonwealth University, * National Engineering Center for Parallel Computer June 19, 2014 Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 1 of / via 25 Scien Ex

Background What s data deduplication? Data deduplication is a scalable compression technique used in large-scale backup systems. Traditional compression compresses a piece of data (e.g., a small file) at byte granularity. Data deduplication compresses the entire storage system at chunk granularity. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 2 of / via 25 Scien Ex

Background What s data deduplication? Data deduplication is a scalable compression technique used in large-scale backup systems. Traditional compression compresses a piece of data (e.g., a small file) at byte granularity. Data deduplication compresses the entire storage system at chunk granularity. The fragmentation problem caused by data deduplication: 1 Slow restore (a 21X decrease!) data we need is dispersed physically. 2 Slow and cumbersome garbage collection data we do NOT need is dispersed physically. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 2 of / via 25 Scien Ex

How the fragmentation arises The first backup: We have 13 chunks, most of which are UNIQUE. Restoring this backup with 3-container-sized LRU cache requires 5 container reads. Restoring this backup with an unlimited cache requires 4 container reads. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 3 of / via 25 Scien Ex

How the fragmentation arises The second backup: We also have 13 chunks, 9 of which are DUPLICATE. Restoring this backup with 3-container-sized LRU cache requires 9 container reads. Restoring this backup with an unlimited cache requires 6 container reads. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 4 of / via 25 Scien Ex

How the fragmentation arises If we delete the first backup: 4 chunks (B, F, H, and K) become invalid. We can NOT reclaim their space without additional mechanisms. Container merge operation: migrate valid chunks into new containers. The most time-consuming phase in garbage collection. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 5 of / via 25 Scien Ex

Our observations and motivations The fragmentation taxonomy Sparse container: a container with a utilization smaller than utilization threshold (e.g., 50%), such as Container IV for backup 2. Out-of-order container: its chunks are intermittently referenced by a backup, such as Container V for backup 2. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 6 of / via 25 Scien Ex

Our observations and motivations The negative impacts and potential solutions: Sparse containers directly amplify read operations, hence hurt both restore and garbage collection. Solution: rewriting referenced chunks in them to new compact containers, i.e., rewriting algorithm. Out-of-order containers hurt restore if the restore cache is small. Solution: Increasing the cache size, or developing more intelligent replacement algorithm than LRU. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 7 of / via 25 Scien Ex

Our observations and motivations How existing rewriting algorithms work? Deduplication is delayed to identify fragmented duplicate chunks. They use a rewriting buffer and identify duplicate but fragmented chunks in the buffer. The chunk M is supposed to be in a sparse container, since it has no physical neighbor in the buffer. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 8 of / via 25 Scien Ex

Our observations and motivations Existing rewriting algorithms: If we extend the rewriting buffer, more physical neighbors of M would be found. M is in an out-of-order container rather than a sparse container! NOT scalable since memory is limited. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( Huazhong June 19, Backup 2014 University Systems 9 of / via 25 Scien Ex

Our observations and motivations The problems of existing rewriting algorithms: They CANNOT accurately differentiate sparse containers from out-of-order containers due to the limited size of the rewriting buffer. As a result, they lose too much storage efficiency, and gain limited restore speed. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 10 of / via 25 Scien Ex

Our observations and motivations The problems of existing rewriting algorithms: They CANNOT accurately differentiate sparse containers from out-of-order containers due to the limited size of the rewriting buffer. As a result, they lose too much storage efficiency, and gain limited restore speed. The challenge: Due to the existence of out-of-order containers, accurately identifying sparse containers requires the complete knowledge of the on-going backup. How can we obtain such knowledge on the fly? Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 10 of / via 25 Scien Ex

Our observations and motivations Rationale Due to the incremental nature of backup, consecutive backups share similar characteristics, including fragmentation. Our key observations: 1 The number of total sparse containers continuously grows. 2 The number of total sparse containers increases smoothly. Only a limited number of emerging sparse containers in each backup. 3 A backup inherits most of the sparse containers of last backup. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 11 of / via 25 Scien Ex

Design and implementation Figure : The system architecture. Colored modules are our contributions. Three contributions: History-Aware Rewriting Algorithm (HAR) in backup (tackling sparse containers); Belady s optimal replacement algorithm (OPT) in restore (tackling out-of-order containers); Container-Marker Algorithm (CMA) in garbage collection (a new reference management). Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 12 of / via 25 Scien Ex

Design and implementation History-Aware Rewriting History-Aware Rewriting (HAR) HAR records sparse containers during the backup, and rewrite referenced chunks in them during next backup. The emerging sparse containers of a backup become the inherited sparse containers of the next backup. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 13 of / via 25 Scien Ex

Design and implementation History-Aware Rewriting History-Aware Rewriting (HAR) HAR records sparse containers during the backup, and rewrite referenced chunks in them during next backup. The emerging sparse containers of a backup become the inherited sparse containers of the next backup. Advantages: NOT rewriting emerging sparse containers does NOT hurt restore performance due to the limited number of emerging sparse containers (observation 2); Rewriting inherited sparse containers does NOT hurt backup performance due to the limited number of inherited sparse containers (observation 2); Identify sparse containers accurately due to Observation 3. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 13 of / via 25 Scien Ex

Design and implementation Optimal cache The problem of out-of-order containers Out-of-order containers hurt restore performance if the restore cache is small. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 14 of / via 25 Scien Ex

Design and implementation Optimal cache The problem of out-of-order containers Out-of-order containers hurt restore performance if the restore cache is small. Our observations: We restore a backup stream according to the fingerprint sequence preserved in the recipe. As a result, We exactly know the future access pattern of containers during the restore. more intelligent cache replacement algorithms than LRU are possible. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 14 of / via 25 Scien Ex

Design and implementation Optimal cache The problem of out-of-order containers Out-of-order containers hurt restore performance if the restore cache is small. Our observations: We restore a backup stream according to the fingerprint sequence preserved in the recipe. As a result, We exactly know the future access pattern of containers during the restore. more intelligent cache replacement algorithms than LRU are possible. Belady s optimal replacement algorithm: When the restore cache is full, the container that will not be accessed for the longest time in the future is evicted. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 14 of / via 25 Scien Ex

Design and implementation Container-Marker Algorithm Two-phased garbage collection: 1 Chunk reference management: find invalid chunks, and calculate the utilizations of containers to identify which containers are worth being merged (i.e., sparse containers). its overhead is proportional to the number of chunks. 2 Container merge: migrate valid chunks in sparse containers to new containers. it competes with regular backup and urgent restore for I/O bandwidth. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 15 of / via 25 Scien Ex

Design and implementation Container-Marker Algorithm Two-phased garbage collection: 1 Chunk reference management: find invalid chunks, and calculate the utilizations of containers to identify which containers are worth being merged (i.e., sparse containers). its overhead is proportional to the number of chunks. 2 Container merge: migrate valid chunks in sparse containers to new containers. it competes with regular backup and urgent restore for I/O bandwidth. Our observation: After the latest backup referring to the sparse container is deleted, we can directly reclaim the container rather than merging it. Simplified reference management is possible! Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 15 of / via 25 Scien Ex

Design and implementation Container-Marker Algorithm Container-Marker Algorithm (CMA) Maintains a container manifest for each dataset. The manifest records IDs of all containers related to the dataset. Each container ID is paired with a backup time that indicates the most recent backup referring to the container. Suppose we delete all backups before time T. All containers with a backup time smaller than T can be reclaimed. The overhead is proportional to the number of containers rather than chunks. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 16 of / via 25 Scien Ex

Evaluation Evaluation Methodology We implement the baseline (no rewriting) and two existing rewriting algorithms (CBR @ SYSTOR 12 and CAP @ FAST 13) for comparisons. Deduplication ratio: the size of the non-deduplicated data divided by that of the deduplicated data. Speed factor (@ FAST 13): a metric to measure restore performance. It s defined as the size of restored data (MB) per container read. The number of valid containers (the actual storage cost after GC). Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 17 of / via 25 Scien Ex

Evaluation Table : Characteristics of datasets. dataset name VMDK Linux Synthetic total size 1.44TB 104GB 4.5TB # of versions 102 258 400 deduplication ratio 25.44 45.24 37.26 avg. chunk size 10.33KB 5.29KB 12.44KB sparse medium severe severe out-of-order severe medium medium Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 18 of / via 25 Scien Ex

Evaluation Table : Default settings. fingerprint index in-memory container size 4MB utilization threshold 50% caching scheme OPT backup retention time 20 days container merge N/A Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 19 of / via 25 Scien Ex

Evaluation deduplication ratio 50 45 40 35 30 25 20 15 10 5 0 baseline CBR CAP HAR VMDK Linux Synthetic Figure : The comparisons between HAR and other rewriting algorithms in terms of deduplication ratio. Conclusion (1) HAR rewrites less data than CBR and CAP. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 20 of / via 25 Scien Ex

Evaluation baseline(lru) baseline(opt) CBR CAP HAR baseline(lru) baseline(opt) CBR CAP HAR speed factor 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 20 40 60 80 100 version number (a) VMDK speed factor 4 3.5 3 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 version number (b) Linux Figure : The comparisons of rewriting algorithms in terms of restore performance. The cache is 512- and 32-container-sized in VMDK and Linux respectively. Conclusion (2) HAR achieves better restore performane, while rewrites less data than CBR and CAP. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 21 of / via 25 Scien Ex

Evaluation baseline CBR CAP HAR baseline CBR CAP HAR speed factor 4 3.5 3 2.5 2 1.5 1 0.5 0 64 128 256 512 1024 2048 4096 cache size (a) VMDK speed factor 4 3.5 3 2.5 2 1.5 1 0.5 0 4 8 16 32 64 128 256 cache size (b) Linux Figure : The comparisons of rewriting algorithms under various cache size. Speed factor is the average value of last 20 backups. The cache size is in terms of # of containers. Conclusion (3) HAR works significantly better when the restore cache is large. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 22 of / via 25 Scien Ex

Evaluation # of valid containers 13000 12000 11000 10000 9000 8000 7000 6000 5000 baseline CBR CAP HAR # of valid containers 800 700 600 500 400 300 200 baseline CBR CAP HAR 4000 20 30 40 50 60 70 80 90 100 version number (a) VMDK 100 20 45 70 95 120 145 170 195 220 245 version number (b) Linux Figure : The comparisons of rewriting algorithms in terms of the storage cost after garbage collection. Conclusion (4) After GC, HAR has lowest storage cost since it reduces sparse containers. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 23 of / via 25 Scien Ex

More in the paper A hybrid rewriting scheme: HAR+CBR; HAR+CAP. More experimental results: For Synthetic dataset. Metadata overhead of garbage collection. Varying the utilization threshold in HAR. Related work. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 24 of / via 25 Scien Ex

Summary The fragmentation taxonomy: Sparse containers hurt both restore and garbage collection. Out-of-order containers hurt restore if the cache is small. History-Aware Rewriting: rewrites less data but gains more restore speed than existing work. Solve the sparse container problem. Optimal cache: reduces the cache size we require. Alleviate the out-of-order container problem. Container-Marker Algorithm: simpler and lower metadata overhead. A new reference management. Min Fu, Dan Feng, Yu Hua, Xubin He, Accelerating Zuoning Chen Restore *, Wen and Xia Garbage, Fangting Collection Huang in, Deduplication-based Qing Liu ( June Huazhong 19, Backup 2014 University Systems 25 of / via 25 Scien Ex