The Effectiveness of Deduplication on Virtual Machine Disk Images

Similar documents
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Some Practice Problems on Hardware, File Organization and Indexing

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Deploying De-Duplication on Ext4 File System

Lab 0 Getting Started. Sistemas Operativos: Práctica 0 1

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

CS197U: A Hands on Introduction to Unix

The Logic of Physical Garbage Collection in Deduplicating Storage

Virtualization. ...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania.

Reducing Replication Bandwidth for Distributed Document Databases

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

19 File Structure, Disk Scheduling

SmartMD: A High Performance Deduplication Engine with Mixed Pages

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

File System Aging: Increasing the Relevance of File System Benchmarks

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

EECS 482 Introduction to Operating Systems

File Systems: Fundamentals

File Systems: Fundamentals

How To Install Java Manually Linux Ubuntu Bit

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Using VMware Player 3.0 with USB Pocket Hard Drive For IT Curriculum

How To Manually Install Driver Ubuntu Server On Virtualbox

Opera Web Browser Archive - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

A study of practical deduplication

Storage and File System

Building an Inverted Index

VIRTUALBOX UBUNTU EBOOK

Storage and File Hierarchy

Using DATA Files for IBIS-AMI Models. Lance Wang DesignCon IBIS Summit Santa Clara, CA, USA Feburary 3 rd, 2017

Workloads. CS 537 Lecture 16 File Systems Internals. Goals. Allocation Strategies. Michael Swift

Speeding Up Cloud/Server Applications Using Flash Memory

COS 318: Operating Systems

CSC501 Operating Systems Principles. OS Structure

Virtualization Technique For Replica Synchronization

CSE 373 OCTOBER 25 TH B-TREES

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

How to find and Delete Duplicate files in directory on Linux server with find and fdupes command

Operating Systems Design Exam 2 Review: Spring 2011

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

CS 416: Opera-ng Systems Design March 23, 2012

CS-245 Database System Principles

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING. Linear Search. Binary Search.

Chapter 12: Query Processing. Chapter 12: Query Processing

Advanced Systems Security: Virtual Machine Systems

Operating system hardening

CSE 544 Principles of Database Management Systems

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

Lecture S3: File system data layout, naming

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

W4118: virtual machines

Computer Systems Laboratory Sungkyunkwan University

Get VirtualBox. VirtualBox/Ubuntu Setup. Go to and select Downloads.

Ubuntu Install Partition Server On. Virtualbox 4.2 >>>CLICK HERE<<<

[537] Fast File System. Tyler Harter

vsphere Design and Deploy Fast Track v6 Additional Slides

An Introduction to Big Data Formats

COS 318: Operating Systems

Operating Systems 4/27/2015

Acronis Backup & Recovery 11.5

Introduction to Linux Overview and Some History

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Hash-Based Indexes. Chapter 11

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Setting Up the DR Series System on Veeam

Chapter 13: Query Processing

More on Address Translation. CS170 Fall T. Yang Some of slides from UCB CS162 by Kubiatowicz

Overview of System Virtualization: The most powerful platform for program analysis and system security. Zhiqiang Lin

US Patent 6,658,423. William Pugh

Outline. Inter-Process Communication. IPC across machines: Problems. CSCI 4061 Introduction to Operating Systems

Information Retrieval and Organisation

Richer File System Metadata Using Links and Attributes

File Structures and Indexing

Who s Marek? JBoss Developer. Part of Electronic music lover. Lead of JBoss AS in Fedora

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

Chapter 12: Query Processing

COMP 530: Operating Systems File Systems: Fundamentals

Backup and Recovery Best Practices

Digitizer operating system support

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

CS127: B-Trees. B-Trees

Lecture 16: Sorting. CS178: Programming Parallel and Distributed Systems. April 2, 2001 Steven P. Reiss

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Rethinking Deduplication Scalability

ECE 331 Hardware Organization and Design. UMass ECE Discussion 11 4/12/2018

Memory Management. Today. Next Time. Basic memory management Swapping Kernel memory allocation. Virtual memory

Leaseweb Hosting Services - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

vsan All Flash Features First Published On: Last Updated On:

Uninstalling And Manually Install Vmware Tools Ubuntu Guest

Why Datrium DVX is Best for VDI

Transcription:

The Effectiveness of Deduplication on Virtual Machine Disk Images Keren Jin & Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz

Motivation Virtualization is widely deployed for multiplexing services Increases the demand for storage on a single node VM disk images have a lot of pages in common Binaries shared between disk images VM doesn t know the internal structure of the file system on the virtual disk Deduplication can find identical chunks in different VM disk images Storage savings can range from 10 80% or more Which factors affect deduplication ratios? 2

Methodology overview VMs are downloaded from internet Used freely-available images Wide range of pre-built functionality VM disk images were chunked Produced chunk lists for each image Deduplication experiments run against sets of chunk lists Determined deduplication ratio and amounts of sharing Chunk-wise compression: deduplication + zip 3

Chunking Fixed-size Variable-size Break VM disk images into chunks Fixed-size chunking: constant chunk size. Variable-size chunking: adjust Rabin fingerprint parameters to obtain desired size Use secure hash of the chunk content for chunk ID Zero-filled chunks are all identical Generate sorted list of chunks (easy to merge) A chunk store consists of chunk IDs from a group of VMs 4

Chunking VM disk images 0 1 2 3 4 5 6 7 Disk image may be split into multiple image files Some images use files 2GB long Fixed-size chunking has boundary-shifting problem Adding a small amount of data may shift content Chunk size is large (4KB): small amount of data may be a disk block Chunk each image file separately, as each image file has VMM-dependent header 5

Chunking VM disk images 0 1 2 3 4 5 6 7 0 1 7 2 3 4 5 6 Disk image may be split into multiple image files Some images use files 2GB long Fixed-size chunking has boundary-shifting problem Adding a small amount of data may shift content Chunk size is large (4KB): small amount of data may be a disk block Chunk each image file separately, as each image file has VMM-dependent header 5

Deduplication categories Find identical chunk IDs in a chunk store Deduplication ratio = 1-stored_bytes/original_bytes Chunk categories: Stored 14/23; intra 3/23; inter 4/23; intra-inter 2/23 F9 50 F9 8A 73 42 91 F9 8A 45 1F Stored 55 8A 21 42 79 E4 80 73 11 E4 45 1F Sharing: Intra Inter Intra-inter None 6

Homogeneous and heterogeneous VMs Effect of VM similarity on deduplication ratio 14 Ubuntu 8.04 LTS 13 various Unix and Linux instances Variable-sized chunking: 1 KB Similar VMs: deduplication is very effective! Slow marginal rate of increase Dissimilar VMs: less effective (but still helps!) Marginal rate is higher Larger chunk size produces similar ratio and category distributions Similar VMs Dissimilar VMs 7

Chunk categories in different operating systems More intra-inter sharing in Linux than BSD Linux is more homogeneous Larger chunk sizes do not impact sharing ratio much Many small zero chunks in Linux Intrinsic zero chunks due to file system or data files Operating system groups Variable-length chunks, avg. 512B 8

Chunk count distribution CDF of chunks by count and total size from Linux chunk store Space utilization 70% of chunks are unique 20% of space is zero-filled chunks Chunk reuse Most chunks are used only a few times 9

Effect of chunk size Smaller chunk sizes higher deduplication ratios Less chance of the avalanche effect from rearranging chunks Fixed size chunking performs well Much easier to implement in an online system Small zero chunks: caused by guest OS and apps Empty disk space would cause larger chunks Ubuntu Server: 6.10, 7.04, 7.10 and 8.04. Fixed and variable size chunking: 512, 1024, 2048, 4096 bytes 10

Effect of OS version OS versions closer together deduplicate better Consecutive releases have higher sharing Still a high degree of deduplication even between non-consecutive releases Mature operating systems change little across versions Ubuntu Server and Fedora: various versions Fixed and variable size chunking: 512 bytes 11

Effects of locale on deduplication Different locales deduplicate very well Code remains the same Distributions often include files for all languages Config files determining locale simply select the right files for the language VM instances for different locales are highly similar Ubuntu and Fedora server: English & French 12

Effects of OS lineage Deduplicate a wide range of distributions from two lineages Debian Red Hat Result: deduplication isn t as effective as for other cases Source code may be similar Binaries differ significantly between distributions 13

Virtual Machine Managers: VMware vs. VirtualBox Ubuntu Server 8.04.1 images in VMware & VirtualBox Sparse images Flat images Different VMMs generate differently aligned headers Actual guest data are duplicated Good dedup effectiveness for both sparse and flat VM disks Sparse disk images Flat disk images 14

Result: package installation Install two package sets in common application areas Vary installation order Dedup pairs of images Installation order has little effect on deduplication ratio Different package sets don t have much new data Common dependencies appear in both installations 15

Packaging systems impact on deduplication effectiveness Compare packaging systems: install in same order for each experiment deb (Ubuntu) vs. rpm (CentOS) rpm (Fedora) vs. rpm (CentOS) Relatively little sharing: OS difference overwhelms package manager similarity 16

Does package removal matter? Removed original packages Removed packages typically leave data in image Image with package removed resembles image with package installed However, reverted image differs from original image Don t bother removing packages from deduplicated VM disks... 17

How effective is chunk compression? Compress with zip after deduplication Reduces space by 40% Compression level matters little: small window size Larger chunks higher compression Tradeoff: longer time to compress and decompress 18

Ongoing work: spatial locality A spatial locality occurs if the same group of adjacent chunks appears multiple times in a chunk store Don t count sub-localities if covered by super-localities. Methodology Generate every possible locality by concatenating adjacent chunk IDs for specified length Deduplicate Remove covered sub-localities Detecting ALL localities is hard Detecting certain lengths of localities is more tractable 19

Result: spatial locality Number of localities decreases sharply as locality length increases Long localities exist while their immediate predecessors and successors do not Different VM group might have different locality pattern Working on showing this... Ubuntu JeOS 8.04 (one instance) Fixed-size chunking, 1KB 20

Future work Relationship between deduplication ratio and userrelated actions Examine disk images after users have configured them Expect less deduplication, but still effective, especially for similar operating systems Explore locality issues: will deduplication hurt sequential I/O? Implement deduplication on the fly in VM manager 21

Conclusions Deduplication is efficient for virtual machines, often saving 50 80% More virtual machine instances more savings More homogeneity more savings Fixed size chunking works well! Deduplication works well across many differences OS version Package installations Locale Other differences reduce effectiveness more, but deduplication is still effective Integration with conventional data compression works, but limited by relatively small chunk sizes 22

Thanks! Questions? Research supported in part by Symantec & UC Micro 23