The Logic of Physical Garbage Collection in Deduplicating Storage

Size: px
Start display at page:

Download "The Logic of Physical Garbage Collection in Deduplicating Storage"

Transcription

1 The Logic of Physical Garbage Collection in Deduplicating Storage Fred Douglis Abhinav Duggal Philip Shilane Tony Wong Dell EMC Shiqin Yan University of Chicago Fabiano Botelho Rubrik 1

2 Deduplication in Data Domain Filesystem (DDFS) Fingerprint Index File 1 File 2 fp CID R S T W W X Y Z R C1 Variable sized chunks Variable sized chunks S C1 Generate fingerprints Generate fingerprints R S T W W X Y Z T C2 R fp S fp T fp W fp W fp X fp Y fp Z fp W C2 Containers holding chunks C1 R S C3 X Y X C3 Y C3 C2 T W C4 Z Z C4 2

3 File Representation in DDFS COPY fastcopy creates new root into same tree L 3 L 4 L 5 L 6 L 5 Files represented as a Merkle tree of fingerprints L 6 Lp chunks (metadata) L 2 L 2 L 1 : R fp S fp T fp U fp V fp W fp X fp Y fp L 1 : R fp S fp Z fp R Y S L 0 : Chunks stored on disk in containers 3

4 Deduplication Workloads on Data Domain Traditional backups Weekly full and daily incremental backups Full backups tend to be very large 100GBs to TBs Much content in full backups repeats previous full Typically, 10-20x total compression (TC) 20x TC = 10x dedup and 2x compression New workloads Synthetic full backups Send changes and a recipe to create a single full backup from some previous backup Daily fulls High TC (100x-400x or higher) High file count 100M to 1 billion small files 4

5 Garbage Collection in a Deduplication Filesystem Fingerprint Index File 1 File 2 File 3 fp CID R C1 S C1 Duplicates are sometimes written to improve throughput T W C2 C2 X C3 C1 Containers holding chunks R S C3 X Y Y Z C3 C4 C2 T W C4 Z Duplicate chunk Q C5 C5 Q Y Y C5 Shared chunk 5

6 Evolution of GC in DDFS Logical GC (LGC) Depth-first traversal of per-file Merkle tree on disk to mark live chunks in memory In-memory data structures may not allow system to track all chunks, so an extra mark phase ( pre-phases ) is used when necessary Physical GC (PGC) Breadth-first traversal of the physical layout of Merkle trees to mark live chunks in memory Similar to LGC, pre-phases may be needed Phase-optimized Physical GC (PGC+) Improvement over PGC by removing pre-phases, plus other optimizations 6

7 Logical GC Phases Merge Merge in-memory Index on disk Enumeration Depth-first walk and mark live chunks in an in-memory Bloom filter called live vector Filter Create live instance vector (also a Bloom filter) from live vector to remove the duplicates Select Select best containers to compact Copy Copy live chunks from selected containers into new containers and delete old containers Mark phase Sweep phase 7

8 Enumeration Phase (Logical GC) F1 F1 L6 L6 L2 shared L1 L1 L2 L1 Only L p chunks are traversed L0 L0 8

9 Logical GC àphysical GC Logical enumeration performance is sensitive to the following parameters Total compression factor Number of small files Spatial locality of L p Physical GC addresses these performance issues 9

10 Physical GC (PGC) Uses breadth-first walk instead of per-file depth-first walk during enumeration Uses Perfect Hash Vector(PHV) to store L P s for assisting the breadth-first walk Uses less memory Needed for doing checksums to prevent corruption New analysis phase to build Perfect Hash Functions for L P s Remaining phases are same as logical GC LGC PGC Live vector Live instance vector Walk Vector Live vector Live instance vector Bloom filters PHV Bloom filters 10

11 Collision Free - Perfect Hashing Vector (PH vec ) 0 1 n - 1 s 1 s 2 s n Fingerprint set S PHF (m n) Collision-free hash function which maps a fingerprint to a unique position in a bit vector m - 1 Bit vector 11

12 Analysis Phase On-disk container index FP CID type fp 1 10 L 0 fp 2 5 L P fp 3 30 L P fp n 40 In-memory Perfect Hash functions of Lp #fps 12

13 Benefits & Costs of Physical Enumeration Pro: Sequential scan of containers on disk All L 6, then all L 5, down to L 1 s Relatively few containers store high-level metadata No need to keep revisiting same L p containers due to fastcopy (high deduplication) Con: extra analysis cost doesn t help traditional workloads and due to pre-phases we may have to run analysis twice! 13

14 LGC and PGC phases (including pre-phases) Logical GC 1. Pre-merge 2. Pre-enumeration 3. Pre-filter 4. Pre-select 5. Candidate 6. Enumeration 7. Merge 8. Filter 9. Copy 10. Summary Prephases/sampli ng phases Physical GC 1. Pre-merge 2. Pre-analysis 3. Pre-enumeration 4. Pre-filter 5. Pre-select 6. Merge 7. Analysis 8. Candidate 9. Enumeration 10. Filter 11. Copy 12. Summary Pre-phases / sampling phases 14

15 Physical GC à Phase-optimized Physical GC Limitations of Physical GC Adds 2 extra phases (pre-analysis and analysis) Slightly degrades GC performance for customers with traditional backup workloads Motivation for Phase-optimized Physical GC (PGC + ) Avoid pre-phases by representing all chunks in memory Can we use Perfect hash as a live vector? Need only 2.7 bits per fingerprint instead of a 6 bits in Bloom filter Can we maintain duplicate recipe without using a Bloom filter? Get 50% memory back Walk Vector PGC Live vector Live instance vector Walk Vector PGC + Live vector PHV Bloom filters PHV PHV 15

16 Phase-optimized Physical GC (PGC+) Phases 1. Merge 2. Analysis 3. Enumeration 4. Select 5. Copy 6. Summary 16

17 PGC+ Analysis and Enumeration Replace Bloom filter with Perfect Hash vector for tracking live and dead chunks In analysis phase build two Perfect hash vectors Lp vector called the walk vector (similar to PGC) All fingerprints(lp + L0) based Perfect Hash vector called live vector Perfect hashing optimizations NUMA-aware Perfect Hashing Cache prefetching of Perfect hash functions and values in the Perfect Hash Vector 17

18 PGC + Copy phase Dynamically remove duplicates during Copy phase C1 C2 fp1, fp2 fp1, fp3 fp1 fp2 fp Initial state Live vector C1 C2 fp1 fp2 fp3 fp1, fp2 fp1, fp Process C2 Live vector C1 fp1, fp2 C2 fp1, fp3 fp1 fp2 fp Process C1 18 Live vector

19 Evaluation Deployed systems Comparison of GC runs for systems upgraded from LGC to PGC Controlled experiments on 4 systems Comparison of LGC vs PGC vs PGC + One phase versus two phase GC DD860 used as default for all experiments Workload used was Synthetic dataset similar to some past deduplication work (e.g., Botelho, et al., FAST 2012) Systems DD2500 DD860 DD890 DD990 CPU(cores*GHz) 8*2.2 GHz 16*2.53 GHz 24*2.8 GHz 40*2.4 GHz Mem(GB) 64 GB 70 GB 94 GB 256 GB Physical Capacity (TB) 122 TB 126 TB 167 TB 319 TB 19

20 Deployed System Results- LGC vs PGC For high TC workloads, PGC improved from LGC up to 20x For high file count workload, PGC improved over LGC by 7x 75% of systems upgraded from LGC to PGC suffered from some degradation but usually not much Hard to compare LGC v/s PGC systems because of some other performance changes introduced with PGC Lab experiments to compare all GC variants with same performance parameters 20

21 GC on Different Platforms (36.6x TC) For this dedup, LGC2 is slightly better than PGC2 but PGC+ is better than LGC2/PGC2 21

22 High Total compression Workload Duration (hours) LGC2 LGC1 PGC2 PGC1 PGC + LGC duration scales with TC LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC + LGC PGC PGC x 73.2x 147x 293x 586x 1170x 2340x Total compression factor (TC) PGC/PGC+ remain flat 22

23 Duration (hours) High file Count Workload LGC1/LGC2 is orders of magnitude slower than PGC 187 LGC2 LGC1 PGC2 PGC1 PGC LGC PGC PGC + High file count(900m) 23

24 Conclusions Shift in workloads required moving from depth-first based mark phase to breadth-first based mark phase PGC works better than LGC for very high TC datasets and large number of small files Due to extra phases and performance constraints introduced in PGC, PGC is not uniformly faster than LGC PGC+ uses various optimizations to improve over PGC, primarily by avoiding multiple mark phases PGC+ is significantly faster than LGC when 2 mark phases are required and orders of magnitude faster for problematic workloads 24

25

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, & Windsor Hsu Backup Recovery Systems Division EMC Corporation Introduction

More information

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu Backup Recovery Systems Division EMC Corporation Abstract For backup

More information

Can t We All Get Along? Redesigning Protection Storage for Modern Workloads

Can t We All Get Along? Redesigning Protection Storage for Modern Workloads Can t We All Get Along? Redesigning Protection Storage for Modern Workloads Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale, Dell EMC https://www.usenix.org/conference/atc18/presentation/allu

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication

More information

Memory Efficient Sanitization of a Deduplicated Storage System

Memory Efficient Sanitization of a Deduplicated Storage System Memory Efficient Sanitization of a Deduplicated Storage System Fabiano C. Botelho Philip Shilane Nitin Garg Windsor Hsu Backup Recovery Systems Division EMC Corporation {fabiano.botelho, philip.shilane}@emc.com

More information

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility Xing Lin *, Guanlin Lu, Fred Douglis, Philip Shilane, Grant Wallace * University of Utah EMC Corporation Data Protection

More information

Deduplication File System & Course Review

Deduplication File System & Course Review Deduplication File System & Course Review Kai Li 12/13/13 Topics u Deduplication File System u Review 12/13/13 2 Storage Tiers of A Tradi/onal Data Center $$$$ Mirrored storage $$$ Dedicated Fibre Clients

More information

Rethinking Deduplication Scalability

Rethinking Deduplication Scalability Rethinking Deduplication Scalability Petros Efstathopoulos Petros Efstathopoulos@symantec.com Fanglu Guo Fanglu Guo@symantec.com Symantec Research Labs Symantec Corporation, Culver City, CA, USA 1 ABSTRACT

More information

Deduplication Storage System

Deduplication Storage System Deduplication Storage System Kai Li Charles Fitzmorris Professor, Princeton University & Chief Scientist and Co-Founder, Data Domain, Inc. 03/11/09 The World Is Becoming Data-Centric CERN Tier 0 Business

More information

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble Work done at Hewlett-Packard

More information

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage) The What, Why and How of the Pure Storage Enterprise Flash Array Ethan L. Miller (and a cast of dozens at Pure Storage) Enterprise storage: $30B market built on disk Key players: EMC, NetApp, HP, etc.

More information

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen *, Wen Xia, Fangting Huang, Qing

More information

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic Shared snapshots Mikulas Patocka Red Hat Czech, s.r.o. Purkynova 99 612 45, Brno Czech Republic mpatocka@redhat.com 1 Abstract Shared snapshots enable the administrator to take many snapshots of the same

More information

The Effectiveness of Deduplication on Virtual Machine Disk Images

The Effectiveness of Deduplication on Virtual Machine Disk Images The Effectiveness of Deduplication on Virtual Machine Disk Images Keren Jin & Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Motivation Virtualization is widely deployed

More information

Characteristics of Backup Workloads in Production Systems

Characteristics of Backup Workloads in Production Systems Characteristics of Backup Workloads in Production Systems Grant Wallace Fred Douglis Hangwei Qian Philip Shilane Stephen Smaldone Mark Chamness Windsor Hsu Backup Recovery Systems Division EMC Corporation

More information

HYDRAstor: a Scalable Secondary Storage

HYDRAstor: a Scalable Secondary Storage HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies (FAST '09) February 26 th 2009 C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J.

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta and Jin Li Microsoft Research, Redmond, WA, USA Contains work that is joint with Biplob Debnath (Univ. of Minnesota) Flash Memory

More information

Advances in Memory Management and Symbol Lookup in pqr

Advances in Memory Management and Symbol Lookup in pqr Advances in Memory Management and Symbol Lookup in pqr Radford M. Neal, University of Toronto Dept. of Statistical Sciences and Dept. of Computer Science http://www.cs.utoronto.ca/ radford http://radfordneal.wordpress.com

More information

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD 1 Backup Speed and Reliability Are the Top Data Protection Mandates What are the top data protection mandates from your organization s IT leadership?

More information

HYDRAstor: a Scalable Secondary Storage

HYDRAstor: a Scalable Secondary Storage HYDRAstor: a Scalable Secondary Storage 7th TF-Storage Meeting September 9 th 00 Łukasz Heldt Largest Japanese IT company $4 Billion in annual revenue 4,000 staff www.nec.com Polish R&D company 50 engineers

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia

More information

E DECS-IE. A Success Guide to Prepare- Dell EMC Avamar Specialist for Implementation Engineers. edusum.com

E DECS-IE. A Success Guide to Prepare- Dell EMC Avamar Specialist for Implementation Engineers. edusum.com E20-594 DECS-IE A Success Guide to Prepare- Dell EMC Avamar Specialist for Implementation Engineers edusum.com Table of Contents Introduction to E20-594 Exam on Dell EMC Avamar Specialist for Implementation

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

File Systems: Fundamentals

File Systems: Fundamentals File Systems: Fundamentals 1 Files! What is a file? Ø A named collection of related information recorded on secondary storage (e.g., disks)! File attributes Ø Name, type, location, size, protection, creator,

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

File Systems: Fundamentals

File Systems: Fundamentals 1 Files Fundamental Ontology of File Systems File Systems: Fundamentals What is a file? Ø A named collection of related information recorded on secondary storage (e.g., disks) File attributes Ø Name, type,

More information

bup: the git-based backup system Avery Pennarun

bup: the git-based backup system Avery Pennarun bup: the git-based backup system Avery Pennarun 2011 04 30 The Challenge Back up entire filesystems (> 1TB) Including huge VM disk images (files >100GB) Lots of separate files (500k or more) Calculate/store

More information

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015! Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015! ! My Topic for Today! Goal: a reliable longest name prefix lookup performance

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

A study of practical deduplication

A study of practical deduplication A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research Why Dutch is Not Here A study of practical deduplication Dutch

More information

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:

More information

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved. 1 Using patented high-speed inline deduplication technology, Data Domain systems identify redundant data as they are being stored, creating a storage foot print that is 10X 30X smaller on average than

More information

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 1-5-29 DEBAR: A Scalable High-Performance Deduplication

More information

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD) University Paderborn Paderborn Center for Parallel Computing Technical Report dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD) Dirk Meister Paderborn Center for Parallel Computing

More information

Scale-out Data Deduplication Architecture

Scale-out Data Deduplication Architecture Scale-out Data Deduplication Architecture Gideon Senderov Product Management & Technical Marketing NEC Corporation of America Outline Data Growth and Retention Deduplication Methods Legacy Architecture

More information

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3

More information

A Scalable Inline Cluster Deduplication Framework for Big Data Protection

A Scalable Inline Cluster Deduplication Framework for Big Data Protection University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of Summer 5-30-2012 A Scalable Inline Cluster Deduplication

More information

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge Smitha.M. S, Prof. Janardhan Singh Mtech Computer Networking, Associate Professor Department of CSE, Cambridge

More information

Lecture 13: Garbage Collection

Lecture 13: Garbage Collection Lecture 13: Garbage Collection COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer/Mikkel Kringelbach 1 Garbage Collection Every modern programming language allows programmers

More information

arxiv: v3 [cs.dc] 27 Jun 2013

arxiv: v3 [cs.dc] 27 Jun 2013 RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups arxiv:1302.0621v3 [cs.dc] 27 Jun 2013 Chun-Ho Ng and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

HP Dynamic Deduplication achieving a 50:1 ratio

HP Dynamic Deduplication achieving a 50:1 ratio HP Dynamic Deduplication achieving a 50:1 ratio Table of contents Introduction... 2 Data deduplication the hottest topic in data protection... 2 The benefits of data deduplication... 2 How does data deduplication

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++

More information

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Installing and Configuring the DM-MPIO WHITE PAPER INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Abstract This white paper introduces XtremIO replication on X2 platforms. XtremIO replication leverages

More information

HPE Data Protector Deduplication

HPE Data Protector Deduplication Technical white paper HPE Data Protector Deduplication Introducing Backup to Disk devices and deduplication Table of contents Summary 3 Overview 3 When to use deduplication 4 Advantages of B2D devices

More information

Data Reduction Meets Reality What to Expect From Data Reduction

Data Reduction Meets Reality What to Expect From Data Reduction Data Reduction Meets Reality What to Expect From Data Reduction Doug Barbian and Martin Murrey Oracle Corporation Thursday August 11, 2011 9961: Data Reduction Meets Reality Introduction Data deduplication

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu, Dan Feng, and Yu Hua, Huazhong University of Science and Technology; Xubin He, Virginia Commonwealth University; Zuoning

More information

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

More information

Technology Insight Series

Technology Insight Series EMC Avamar for NAS - Accelerating NDMP Backup Performance John Webster June, 2011 Technology Insight Series Evaluator Group Copyright 2011 Evaluator Group, Inc. All rights reserved. Page 1 of 7 Introduction/Executive

More information

Building a High-performance Deduplication System

Building a High-performance Deduplication System Building a High-performance Deduplication System Fanglu Guo Petros Efstathopoulos Symantec Research Labs Symantec Corporation, Culver City, CA, USA Abstract Modern deduplication has become quite effective

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

Cumulus: Filesystem Backup to the Cloud

Cumulus: Filesystem Backup to the Cloud Cumulus: Filesystem Backup to the Cloud 7th USENIX Conference on File and Storage Technologies (FAST 09) Michael Vrable Stefan Savage Geoffrey M. Voelker University of California, San Diego February 26,

More information

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems SkimpyStash: Key Value

More information

The World s Fastest Backup Systems

The World s Fastest Backup Systems 3 The World s Fastest Backup Systems Erwin Freisleben BRS Presales Austria 4 EMC Data Domain: Leadership and Innovation A history of industry firsts 2003 2004 2005 2006 2007 2008 2009 2010 2011 First deduplication

More information

RPE: The Art of Data Deduplication

RPE: The Art of Data Deduplication RPE: The Art of Data Deduplication Dilip Simha Advisor: Professor Tzi-cker Chiueh Committee advisors: Professor Erez Zadok & Professor Donald Porter Department of Computer Science, StonyBrook University

More information

HP Data Protector 9.0 Deduplication

HP Data Protector 9.0 Deduplication Technical white paper HP Data Protector 9.0 Deduplication Introducing Backup to Disk devices and deduplication Table of contents Summary 3 Overview 3 When to use deduplication 4 Advantages of B2D devices

More information

Directory. File. Chunk. Disk

Directory. File. Chunk. Disk SIFS Phase 1 Due: October 14, 2007 at midnight Phase 2 Due: December 5, 2007 at midnight 1. Overview This semester you will implement a single-instance file system (SIFS) that stores only one copy of data,

More information

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection Data Footprint Reduction with Quality of Service (QoS) for Data Protection By Greg Schulz Founder and Senior Analyst, the StorageIO Group Author The Green and Virtual Data Center (Auerbach) October 28th,

More information

Storage S3 in backup. When? Value Architecture.

Storage S3 in backup. When? Value Architecture. Storage S3 in backup When? Value Architecture Daniel.Olkowski@dell.com Agenda Storage S3 Storage S3 in backup Where to use Where not to use Use cases Prices 2 of Y S3 storage as backup media / Storage

More information

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014 VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014 Table of Contents Introduction.... 3 Features and Benefits of vsphere Data Protection... 3 Additional Features and Benefits of

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

Application-Aware Big Data Deduplication in Cloud Environment

Application-Aware Big Data Deduplication in Cloud Environment IEEE TRANSACTIONS ON CLOUD COMPUTING 1 Application-Aware Big Data Deduplication in Cloud Environment Yinjin Fu, Nong Xiao, Hong Jiang, Fellow, IEEE, Guyu Hu, and Weiwei Chen Abstract Deduplication has

More information

How to Reduce Data Capacity in Objectbased Storage: Dedup and More

How to Reduce Data Capacity in Objectbased Storage: Dedup and More How to Reduce Data Capacity in Objectbased Storage: Dedup and More Dong In Shin G-Cube, Inc. http://g-cube.kr Unstructured Data Explosion A big paradigm shift how to generate and consume data Transactional

More information

A New Key-Value Data Store For Heterogeneous Storage Architecture

A New Key-Value Data Store For Heterogeneous Storage Architecture A New Key-Value Data Store For Heterogeneous Storage Architecture brien.porter@intel.com wanyuan.yang@intel.com yuan.zhou@intel.com jian.zhang@intel.com Intel APAC R&D Ltd. 1 Agenda Introduction Background

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

SmartMD: A High Performance Deduplication Engine with Mixed Pages

SmartMD: A High Performance Deduplication Engine with Mixed Pages SmartMD: A High Performance Deduplication Engine with Mixed Pages Fan Guo 1, Yongkun Li 1, Yinlong Xu 1, Song Jiang 2, John C. S. Lui 3 1 University of Science and Technology of China 2 University of Texas,

More information

EMC DATA DOMAIN PRODUCT OvERvIEW

EMC DATA DOMAIN PRODUCT OvERvIEW EMC DATA DOMAIN PRODUCT OvERvIEW Deduplication storage for next-generation backup and archive Essentials Scalable Deduplication Fast, inline deduplication Provides up to 65 PBs of logical storage for long-term

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

GFS-python: A Simplified GFS Implementation in Python

GFS-python: A Simplified GFS Implementation in Python GFS-python: A Simplified GFS Implementation in Python Andy Strohman ABSTRACT GFS-python is distributed network filesystem written entirely in python. There are no dependencies other than Python s standard

More information

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path. Hashing B+-tree is perfect, but... Selection Queries to answer a selection query (ssn=) needs to traverse a full path. In practice, 3-4 block accesses (depending on the height of the tree, buffering) Any

More information

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES May, 2017 Contents Introduction... 2 Overview... 2 Architecture... 2 SDFS File System Service... 3 Data Writes... 3 Data Reads... 3 De-duplication

More information

White paper ETERNUS CS800 Data Deduplication Background

White paper ETERNUS CS800 Data Deduplication Background White paper ETERNUS CS800 - Data Deduplication Background This paper describes the process of Data Deduplication inside of ETERNUS CS800 in detail. The target group consists of presales, administrators,

More information

Erik Riedel Hewlett-Packard Labs

Erik Riedel Hewlett-Packard Labs Erik Riedel Hewlett-Packard Labs Greg Ganger, Christos Faloutsos, Dave Nagle Carnegie Mellon University Outline Motivation Freeblock Scheduling Scheduling Trade-Offs Performance Details Applications Related

More information

Compression and Decompression of Virtual Disk Using Deduplication

Compression and Decompression of Virtual Disk Using Deduplication Compression and Decompression of Virtual Disk Using Deduplication Bharati Ainapure 1, Siddhant Agarwal 2, Rukmi Patel 3, Ankita Shingvi 4, Abhishek Somani 5 1 Professor, Department of Computer Engineering,

More information

UniCredit Global Backup Infrastructure. Mirco Lissandrini, Team Leader GCC Open Storage

UniCredit Global Backup Infrastructure. Mirco Lissandrini, Team Leader GCC Open Storage UniCredit Global Backup Infrastructure Mirco Lissandrini, Team Leader GCC Open Storage email: Mirco.Lissandrini@unicredit.eu Milan, 28 May 2013 AGENDA UniCredit at a glance UniCredit Business Integrated

More information

Getting it Right: Testing Storage Arrays The Way They ll be Used

Getting it Right: Testing Storage Arrays The Way They ll be Used Getting it Right: Testing Storage Arrays The Way They ll be Used Peter Murray Virtual Instruments Flash Memory Summit 2017 Santa Clara, CA 1 The Journey: How Did we Get Here? Storage testing was black

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

DELL EMC DATA DOMAIN WITH RMAN USING ENCRYPTION FOR ORACLE DATABASES

DELL EMC DATA DOMAIN WITH RMAN USING ENCRYPTION FOR ORACLE DATABASES DELL EMC DATA DOMAIN WITH RMAN USING ENCRYPTION FOR ORACLE DATABASES A Technical Review ABSTRACT With the threat of security breaches, customers are putting in place defenses from these security breaches.

More information

Operating Systems. File Systems. Thomas Ropars.

Operating Systems. File Systems. Thomas Ropars. 1 Operating Systems File Systems Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2017 2 References The content of these lectures is inspired by: The lecture notes of Prof. David Mazières. Operating

More information

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan, Jian Wen, Zhichao Yan, Hong Jiang, Witawas Srisa-an, Baiping Wang, Hao Luo Outline Background and Motivation

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

HP StoreOnce: reinventing data deduplication

HP StoreOnce: reinventing data deduplication HP : reinventing data deduplication Reduce the impact of explosive data growth with HP StorageWorks D2D Backup Systems Technical white paper Table of contents Executive summary... 2 Introduction to data

More information

Extreme Storage Performance with exflash DIMM and AMPS

Extreme Storage Performance with exflash DIMM and AMPS Extreme Storage Performance with exflash DIMM and AMPS 214 by 6East Technologies, Inc. and Lenovo Corporation All trademarks or registered trademarks mentioned here are the property of their respective

More information

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

PageForge: A Near-Memory Content- Aware Page-Merging Architecture PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability Topics COS 318: Operating Systems File Performance and Reliability File buffer cache Disk failure and recovery tools Consistent updates Transactions and logging 2 File Buffer Cache for Performance What

More information

Field Update Expanded Deduplication Sizing Guidelines. Oct 2015

Field Update Expanded Deduplication Sizing Guidelines. Oct 2015 Field Update Expanded Deduplication Sizing Guidelines Oct 2015 As part of our regular service pack updates in version 10, we have been making incremental improvements to our media and storage management

More information

HashKV: Enabling Efficient Updates in KV Storage via Hashing

HashKV: Enabling Efficient Updates in KV Storage via Hashing HashKV: Enabling Efficient Updates in KV Storage via Hashing Helen H. W. Chan, Yongkun Li, Patrick P. C. Lee, Yinlong Xu The Chinese University of Hong Kong University of Science and Technology of China

More information

Rhinoback Online Backup. In-File Delta

Rhinoback Online Backup. In-File Delta December 2006 Table of Content 1 Introduction... 3 1.1 Differential Delta Mode... 3 1.2 Incremental Delta Mode... 3 2 Delta Generation... 4 3 Block Size Setting... 4 4 During Backup... 5 5 During Restore...

More information

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Xingbo Wu Yuehai Xu Song Jiang Zili Shao The Hong Kong Polytechnic University The Challenge on Today s Key-Value Store Trends on workloads

More information