Cumulus: Filesystem Backup to the Cloud

Similar documents
FADE: A Secure Overlay Cloud Storage System with Access Control and Assured Deletion. Patrick P. C. Lee

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

The New Economics of Cloud Storage

A Review on Backup-up Practices using Deduplication

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

HYDRAstor: a Scalable Secondary Storage

Tech Brief Wasabi for Consumers & Small Businesses

CS 425 / ECE 428 Distributed Systems Fall 2015

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

400GB GB

AWS Storage Gateway. Not your father s hybrid storage. University of Arizona IT Summit October 23, Jay Vagalatos, AWS Solutions Architect

Advanced Technologies for Cloud Storage. Liwei Ren, Ph.D Data Security Research, Trend Micro August, 2012, Dalian, China

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

KillTest. 半年免费更新服务

Cloud-related Storage Research in Santa Cruz

A study of practical deduplication

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Characteristics of Backup Workloads in Production Systems

Deduplication Storage System

Secure Block Storage (SBS) FAQ

How to Reduce Data Capacity in Objectbased Storage: Dedup and More

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Dell EMC Unity: Built-In Hybrid Cloud & Software Defined Capabilities. Wei Chen Product Technologist Midrange & Entry Systems

How to recover a failed Storage Spaces

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

DASH COPY GUIDE. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31

DEDUPLICATION BASICS

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

The Logic of Physical Garbage Collection in Deduplicating Storage

BEST PRACTICES FOR BACKUP

A Case for Packing and Indexing in Cloud File Systems

CS3600 SYSTEMS AND NETWORKS

Technical Notes. Considerations for Choosing SLC versus MLC Flash P/N REV A01. January 27, 2012

SAM: A Semantic-Aware Multi-Tiered Source De-duplication Framework for Cloud Backup

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

Alternative Approaches for Deduplication in Cloud Storage Environment

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Ten things hyperconvergence can do for you

FEBRUARY - MAY 2017 PROOF OF CONCEPT AND CASE STUDY. IBM Spectrum Protect and Backing up to Object Storage in the Cloud

CS510 Operating System Foundations. Jonathan Walpole

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

FlashArray//m. Business and IT Transformation in 3U. Transform Your Business. All-Flash Storage for Every Workload.

Amazon Elastic File System

ZYNSTRA TECHNICAL BRIEFING NOTE

Arcserve Cloud Frequently Asked Questions

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

IOPStor: Storage Made Easy. Key Business Features. Key Business Solutions. IOPStor IOP5BI50T Network Attached Storage (NAS) Page 1 of 5

Getting Started and System Guide. Version

arxiv: v3 [cs.dc] 27 Jun 2013

bup: the git-based backup system Avery Pennarun

Don t stack your Log on my Log

Protect enterprise data, achieve long-term data retention

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Storage and File Hierarchy

EMC Data Domain for Archiving Are You Kidding?

COS 318: Operating Systems

Directory-Aware File System Backup to Object Storage for Fast On-Demand Restore

Dynamic Memory Allocation

Symantec Design of DP Solutions for UNIX using NBU 5.0. Download Full Version :

Single-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder

Clustered Data ONTAP 8.2

Main Memory and the CPU Cache

SRCMap: Energy Proportional Storage using Dynamic Consolidation

Optimizing for Recovery

7-Mode Data Transition Using SnapMirror

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Tiger Bridge 1.0 Administration Guide

The storage challenges of virtualized environments

Speeding Up Cloud/Server Applications Using Flash Memory

E DECE-IE. A Success Guide to Prepare- Dell EMC Avamar Expert for Implementation Engineers. edusum.com

Symantec NetBackup Backup Planning and Performance Tuning Guide

Data Protection Service Guide

Object Storage Level 100

DAHA AKILLI BĐR DÜNYA ĐÇĐN BĐLGĐ ALTYAPILARIMIZI DEĞĐŞTĐRECEĞĐZ

Scalability, Fidelity, and Containment in the Potemkin Virtual Honeyfarm

Google Disk Farm. Early days

Small Business Data Protection. Redefined.

IM B09 Best Practices for Backup and Recovery of VMware - DRAFT v1

Giza: Erasure Coding Objects across Global Data Centers

MySQL in the Cloud Tricks and Tradeoffs

Discover the all-new CacheMount

Chapter 14: File-System Implementation

Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

RecoverPoint Operations

vmguardian 3.0 Practical Operation Seminar First Edition

Trading Capacity for Data Protection

IBM Content Manager OnDemand on Cloud

CLOUD-SCALE FILE SYSTEMS

Data Centers and Cloud Computing

STORWARE.EU. Simplified Data Protection for Virtual Environments

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Системы хранения IBM. Новые возможности

v5: How to restore a backup image

The Google File System

Transcription:

Cumulus: Filesystem Backup to the Cloud 7th USENIX Conference on File and Storage Technologies (FAST 09) Michael Vrable Stefan Savage Geoffrey M. Voelker University of California, San Diego February 26, 2009 Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 1 / 19

Introduction Cloud computing important emerging area, with a spectrum of implementations Thick cloud: Purchase a complete integrated service from a provider Potentially greater efficiencies Easier to set up Thin cloud: Customer builds application on more generic services More choices among service providers Easier to migrate between providers Potentially lower costs Thin cloud offers some advantages, particularly for applications such as backup How well can we do with such a simple interface? Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 2 / 19

Cumulus: Background and Requirements Network Backup: Functionality Implement backup over a network to provide easy off-site storage Store snapshots of file data at multiple points in time Allow recovery of selected files or entire snapshot Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

Cumulus: Background and Requirements Network Backup: Functionality Implement backup over a network to provide easy off-site storage Store snapshots of file data at multiple points in time Allow recovery of selected files or entire snapshot System Requirements Build on a thin cloud model: simple storage interface only Storage layer need only support put/get of blobs of data, list, delete Implies that application logic must be built into client Focus on cloud storage, but could be FTP server, friend s computer, P2P network,... Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

Cumulus: Background and Requirements Network Backup: Functionality Implement backup over a network to provide easy off-site storage Store snapshots of file data at multiple points in time Allow recovery of selected files or entire snapshot System Requirements Build on a thin cloud model: simple storage interface only Storage layer need only support put/get of blobs of data, list, delete Implies that application logic must be built into client Focus on cloud storage, but could be FTP server, friend s computer, P2P network,... Goals Minimize resource requirements (storage, network) Minimize ongoing monetary costs Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 3 / 19

Cumulus Backup Format Monday Snapshot Roots Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

Cumulus Backup Format Monday Snapshot Roots photos/a photos/b mbox paper Metadata Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

Cumulus Backup Format Monday Snapshot Roots photos/a photos/b mbox paper Metadata photoa photob mbox1 paper1 Data Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

Cumulus Backup Format Monday Tuesday Shared Monday Tuesday Snapshot Roots photos/a photos/b mbox paper mbox' paper' Metadata photoa photob mbox1 paper1 mbox2 paper2 Data Stores filesystem snapshots at multiple points in time Data blocks shared within, between snapshots Minimizes storage, upload bandwidth needed Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19

Aggregation: Minimizing Per-Block Costs Segments Monday Tuesday Snapshot Roots photos/a photos/b mbox paper mbox' paper' Metadata photoa photob mbox1 paper1 mbox2 paper2 Data May have per-file in addition to per-byte costs Protocol overhead: Slower backups from more transactions Per-file overhead at storage server May be exposed as monetary cost by provider Cumulus reduces these costs by aggregating blocks into segments before storage Aggregation follows from our constraints, but may not be needed in other systems Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 5 / 19

Aggregation Challenges: Internal Fragmentation Day 1 Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

Aggregation Challenges: Internal Fragmentation Day 1 Day 2 Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

Aggregation Challenges: Internal Fragmentation Day 1 Day 2 Day 3 Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

Aggregation Challenges: Internal Fragmentation Day 1 Day 4 (new data) Day 2 Day 4 (repacked data) Day 3 Wasted space within segments reclaimed by segment cleaning Tradeoff: space vs. upload bandwidth Contribution: Show how to tune segment size, threshold for cleaning Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 6 / 19

Cumulus Implementation Implemented as 4000 lines C++, Python Execution packages new data into segments, uploads to storage server Client tracks some data locally (not essential for restores): Block hash database Previous snapshot metadata (detect changed files) Other features: Compression/encryption Sub-file incremental updates More details in the paper In real use: I have been using it for over 18 months Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 7 / 19

Evaluation Key Questions: What is the resource (network, storage) overhead imposed by the restricted storage interface? How do these overheads translate into monetary terms? How can aggregation and cleaning be tuned to minimize the cost? How does the prototype perform? Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 8 / 19

Evaluation Traces Fileserver User Duration (days) 157 223 Entries 26673083 122007 Files 24344167 116426 File Sizes Median 0.996 KB 4.4 KB Average 153 KB 21.4 KB Maximum 54.1 GB 169 MB Total 3.47 TB 2.37 GB Update Rates New data/day 9.50 GB 10.3 MB Changed data/day 805 MB 29.9 MB Total data/day 10.3 GB 40.2 MB Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 9 / 19

Evaluation Traces Fileserver User Duration (days) 157 223 Entries 26673083 122007 Files 24344167 116426 File Sizes Median 0.996 KB 4.4 KB Average 153 KB 21.4 KB Maximum 54.1 GB 169 MB Total 3.47 TB 2.37 GB Update Rates New data/day 9.50 GB 10.3 MB Changed data/day 805 MB 29.9 MB Total data/day 10.3 GB 40.2 MB Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 9 / 19

Backup Simulation Compare against optimal backup performance: All unique data must be stored at server All new data must be transferred over network In simulation, compare Cumulus against these baseline values Consider effect of aggregation, cleaning parameters For simplicity, ignore compression and metadata Effects discussed in paper Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 10 / 19

Is Cleaning Necessary? Storage Utilization 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 With Cleaning No Cleaning 0.5 0 50 100 150 200 Time (days) Without segment cleaning, storage utilization steadily decreases Weekly cleaning keeps overhead within a narrow range Exact overhead depends on cleaning parameters Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 11 / 19

How Much Data is Transferred? Overhead vs. Optimal (%) 40 35 30 25 20 15 10 5 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 52 50 48 46 44 42 40 Raw Size (MB/day) Aggressive cleaning, large segments increase overhead 0 0 0.2 0.4 0.6 0.8 1 38 Cleaning Threshold Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 12 / 19

How Much Data is Transferred? Overhead vs. Optimal (%) 40 35 30 25 20 15 10 5 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 52 50 48 46 44 42 40 Raw Size (MB/day) Aggressive cleaning, large segments increase overhead 0 0 0.2 0.4 0.6 0.8 1 38 Cleaning Threshold Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 12 / 19

What is the Storage Overhead? Overhead vs. Optimal (%) 25 20 15 10 5 0 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 0 0.2 0.4 0.6 0.8 1 Cleaning Threshold 3.3 3.2 3.1 3 2.9 2.8 2.7 Raw Size (GB) Large segments increase overhead Too little cleaning leads to large overheads Aggressive cleaning leads to churn, storage overhead when keeping multiple snapshots Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 13 / 19

What is the Storage Overhead? Overhead vs. Optimal (%) 25 20 15 10 5 0 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 0 0.2 0.4 0.6 0.8 1 Cleaning Threshold 3.3 3.2 3.1 3 2.9 2.8 2.7 Raw Size (GB) Large segments increase overhead Too little cleaning leads to large overheads Aggressive cleaning leads to churn, storage overhead when keeping multiple snapshots Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 13 / 19

Estimating Ongoing Backup Costs How do storage, upload translate into total cost for implementing backup? Amazon S3 prices: Storage: $0.15 per GB month Upload: $0.10 per GB Operation: $0.01 per 1000 uploads Effects of varying costs discussed in the paper Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 14 / 19

What Settings Minimize Total Cost? Cost Increase vs. Optimal (%) 50 40 30 20 10 0 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 0 0.2 0.4 0.6 0.8 1 Cleaning Threshold 0.75 0.7 0.65 0.6 0.55 Cost ($/month) Aggressive cleaning, large segments increase overhead Total cost includes per-segment charge: intermediate segment size is best Cleaning threshold 0.4 0.6, segment size 0.5 1 MB work well Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 15 / 19

What Settings Minimize Total Cost? Cost Increase vs. Optimal (%) 50 40 30 20 10 0 16 MB Segments 4 MB Segments 1 MB Segments 512 kb Segments 128 kb Segments 0 0.2 0.4 0.6 0.8 1 Cleaning Threshold 0.75 0.7 0.65 0.6 0.55 Cost ($/month) Aggressive cleaning, large segments increase overhead Total cost includes per-segment charge: intermediate segment size is best Cleaning threshold 0.4 0.6, segment size 0.5 1 MB work well Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 15 / 19

Simulation Summary Storage cost dominates (> 75% in this trace) Cost not overly sensitive to aggregation, cleaning settings Cost within 5 10% of best we could expect Implications for integrated backup? Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 16 / 19

Prototype Evaluations Tested full prototype using backups from two months of user trace Snapshots stored properly, could be restored Ongoing costs come out to $0.24/month for around 2 GB of data Compared with two existing tools for Amazon S3 Brackup and JungleDisk: two other tools capable of filesystem backup to S3 Monthly costs are 19 200% more But, systems designed for more than just backup or not explicitly tuned for cost What about thick cloud? Mozy: integrated online backup solution $5/month for unlimited backups $0.50/GB/month for businesses Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 17 / 19

Summary Cumulus is a cost-effective tool for backup to network storage We show how system parameters can be tuned to minimize total cost Shows specialized server not necessary for implementing low-overhead backup Can choose from variety of storage providers based on cost or other factors Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 18 / 19

Questions? Cumulus is available at http://sysnet.ucsd.edu/projects/cumulus/ Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 19 / 19

Deduplication Cumulus implementation does perform coarse-grained data deduplication Recognizes duplicate data at file or 1 MB block level Block boundaries for deduplication are fixed Deduplication only for a single client, not across clients Server-side support could enable deduplication across clients Doesn t work well with aggregation into segments Does slightly reduce privacy of backup Complicates accounting Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 20 / 19