Generating Realistic Impressions for File-System Benchmarking. Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Similar documents
Generating Realistic Impressions for File-System Benchmarking

Generating Realistic Impressions for File-System Benchmarking

Emulating Goliath Storage Systems with David

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Zettabyte Reliability with Flexible End-to-end Data Integrity

File Systems. Chapter 11, 13 OSPP

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

File System Aging: Increasing the Relevance of File System Benchmarks

Warming up Storage-level Caches with Bonfire

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

University of Osnabruck - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

File System: Interface and Implmentation

Designing a True Direct-Access File System with DevFS

Coerced Cache Evic-on and Discreet- Mode Journaling: Dealing with Misbehaving Disks

Main Points. File layout Directory layout

Preview. COSC350 System Software, Fall

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

File System Internals. Jo, Heeseung

File Systems. CS170 Fall 2018

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

FS Consistency & Journaling

REPRESENTATIVE, REPRODUCIBLE, AND PRACTICAL BENCHMARKING OF FILE AND STORAGE SYSTEMS. Nitin Agrawal

ECE 598 Advanced Operating Systems Lecture 14

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

File System. Preview. File Name. File Structure. File Types. File Structure. Three essential requirements for long term information storage

CS370 Operating Systems

Introduction (1 of 2) Performance and Extension of User Space File. Introduction (2 of 2) Introduction - FUSE 4/15/2014

The Google File System

TFS: A Transparent File System for Contributory Storage

Long-term Information Storage Must store large amounts of data Information stored must survive the termination of the process using it Multiple proces

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

ECE 598 Advanced Operating Systems Lecture 18

CS 318 Principles of Operating Systems

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

TEFS: A Flash File System for Use on Memory Constrained Devices

Case study: ext2 FS 1

Improving Confidence in Network Simulations

Understanding Your Virtual Workload. Irfan Ahmad CTO CloudPhysics

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

Tolera'ng File System Mistakes with EnvyFS

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

CS 550 Operating Systems Spring File System

Virtual memory Paging

The Google File System

Case study: ext2 FS 1

Computer Systems Laboratory Sungkyunkwan University

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

There is a general need for long-term and shared data storage: Files meet these requirements The file manager or file system within the OS

File Management By : Kaushik Vaghani

Virtual Allocation: A Scheme for Flexible Storage Allocation

Enosis: Bridging the Semantic Gap between

Slacker: Fast Distribution with Lazy Docker Containers. Tyler Harter, Brandon Salmon, Rose Liu, Andrea C. Arpaci-Dusseau, Remzi H.

A study of practical deduplication

Current Topics in OS Research. So, what s hot?

University of Wisconsin-Madison

The Google File System

Towards Efficient, Portable Application-Level Consistency

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google File System. Arun Sundaram Operating Systems

File Systems: Fundamentals

Introduction to MapReduce

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

The Google File System

GFS: The Google File System

Outline. V Computer Systems Organization II (Honors) (Introductory Operating Systems) Advantages of Multi-level Page Tables

2. PICTURE: Cut and paste from paper

File Systems: Fundamentals

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

File Systems. File system interface (logical view) File system implementation (physical view)

CS 537 Lecture 6 Fast Translation - TLBs

Advanced Database Systems

CS November 2017

Copyright 2014 Fusion-io, Inc. All rights reserved.

CLOUD-SCALE FILE SYSTEMS

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

File Systems: Allocation Issues, Naming, and Performance CS 111. Operating Systems Peter Reiher

CS3600 SYSTEMS AND NETWORKS

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files.

CS370 Operating Systems

KVZone and the Search for a Write-Optimized Key-Value Store

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

File System Implementation

1 / 23. CS 137: File Systems. General Filesystem Design

Local File Stores. Job of a File Store. Physical Disk Layout CIS657

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

File Directories Associated with any file management system and collection of files is a file directories The directory contains information about

Efficient Metadata Management in Cloud Computing

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Announcements. Persistence: Log-Structured FS (LFS)

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

Transcription:

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

For better or for worse," benchmarks shape a field " David Patterson 2

Inputs to file-system benchmarking Application Disk layout FS logical organization File System Storage device Input: Benchmark workload Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc Input: In-memory state Cold cache/warm cache Input: File-System Image Anything goes! 3

FS images in past: use what is convenient Typical desktop file system w/ no description (SOSP 05) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 10702 files from /usr/local, size 354MB (SOSP 01) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

Performance of find operation Relative Time Taken Disk layout & cache state File-system logical organization 5

Problem scope Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design How to create representative file-system images? 6

Requirements for creating FS images Access to data on file systems and disk layout Properties of file-system metadata [Satyanarayan81, Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07] Disk fragmentation [Smith97] More such studies in future? A technique to create file-system images that is Representative: given a set of input distributions Controllable: supply additional user constraints Reproducible: control & report internal parameters Easy to use: for widespread adoption and consensus 7

Introducing Impressions Powerful statistical framework to generate file-system images Takes properties of file-system attributes as input Works out underlying statistical details of the image Mounted on a disk partition for real benchmarking Satisfies the four design goals Applying Impressions gives useful insights What is the impact on performance and storage size? How does an application behave on a real FS image? 8

Outline Introduction Generating realistic file-system images Applying Impressions: Desktop search Conclusion 9

Overview of Impressions Impressions 10

Properties of file-system metadata Five-year study of file-system metadata [FAST07] (Agrawal, Bolosky, Douceur, Lorch) Used as exemplar for metadata properties in Impressions 11

Features of Impressions Modes of operation for different usages Basic mode: choose default settings for parameters Advanced mode: several individually tunable knobs Thorough statistical machinery ensures accuracy Uses parameterized curve fits Allows arbitrary user constraints Built-in statistical tests for goodness-of-fit Generates namespace, metadata, file content, and disk fragmentation using above techniques 12

Creating valid metadata Creating file-system namespace Uses Generative Model proposed earlier [FAST 07] Explains the process of directory tree creation Accurately regenerates distribution of directory size and of directory depth 13

Creating namespace Cumulative Fraction of % directories of Probability of parent selection Count(subdirs)+2 Dirs Dirs by namespace by subdir count depth Directories by by Subdirectory Namespace Count Depth 100 0.18 0.16 D 0.14 90 G 0.12 80 0.1 0.08 70 0.06 Dataset 0.04 60 D 0.02 Generated G 500 00 22 44 6 8 10 12 14 16 Namespace depth (bin size 1) i Count of subdirectories Directory tree Monte Carlo run Incorporates dirs by depth and dirs by subdir count 14

Creating valid metadata Creating file-system namespace Creating files: stepwise process File size, file extension, file depth, parent directory Uses statistical models & analytical approximations 15

Example: creating realistic file sizes Contribution to used space File Sizes Lognormal Hybrid Containing file size (bytes, log scale) Pure lognormal distribution no longer good fit Hybrid model: lognormal body, Pareto tail Fits observed data more accurately, used to recreate file sizes in Impressions 16

Creating files Files by Files containing by Containing bytes Bytes Fraction of bytes 0.12 0.1 0.08 0.06 0.04 0.02 0 D G 0 8 2K 512K 512M128G File Size (bytes, log scale) S 8 S 9 S 7 S 6 S 5 S 4 S 3 S 2 S 1 i File Size Model Lognormal body, Pareto tail Captures bimodal curve 17

Creating files Top extensions Top Extensions by Count count 1 Fraction of files 0.8 0.6 0.4 0.2 0 Desired others txt null jpg htm h gif exe dll cpp others txt null jpg htm h gif exe dll cpp Generated S 9 S 8 E 9 E 8 S 7 S 6 E 7 E 6 S 5 E 5 S 4 S 3 S 2 S 1 E 4 E 3 E 2 E 1 i File Extensions Percentile values Top 20 extensions account for 50% of files and bytes 18

Creating files S 9 S 8 E 9 E 8 D 9 D 8 S 7 S 6 S 5 E 7 E 6 E 5 D 7 D 6 D 5 S 4 S 3 S 2 S 1 E 4 E 3 E 2 E 1 D 4 D 3 D 2 D 1 Mean Fraction bytes of per files file (log scale) Bytes by namespace depth Files by Files Bytes namespace by by Namespace depth Depth 0.16 0.14 D 0.12 2MB G 768KB 0.1 256KB 0.08 0.06 0.04 64KB 0.02 16KB 0 0 02 2 4 4 66 88 10 12 14 16 i Namespace depth (bin size 1) 1) File Depth Poisson Multiplicative model along with bytes by depth 19

Creating files Fraction of files Files by namespace depth 0.25 0.2 0.15 0.1 0.05 i 0 Files by Namespace Depth (with Special Directories) w/ special dirs D G 0 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count 20

Resolving arbitrary constraints 21

Resolving arbitrary constraints Fraction of files 0.15 0.1 0.05 0 OC Original Constrained 8 2K 512K 8M File Size (bytes, log scale) Contrived for sum Constraint: Given count of files & size distribution, ensure Accurate both for the sum and the distribution sum of file sizes matches a desired total file system size 22

Resolving arbitrary constraints Arbitrarily specified on file system parameters Variant of NP-complete Subset Sum Problem Approximation algorithm based solution (in paper) Oversampling to get additional sample values Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample While constraints are satisfied, constrained distribution also retains original characteristics 23

Interpolation and extrapolation Why don t we just use available data values? Limited to empirical data in input dataset What-if analysis beyond available dataset Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data Technique: Piecewise interpolation 24

Interpolation technique & accuracy 0.14 Piecewise Interpolation Piecewise Interpolation Fraction of bytes 0.12 0.1 0.08 0.06 0.04 0.02 0 Segment Value 0.06 0.04 0.02 0 Interpola9on: Seg 19 0 50 100 File System Size (GB) Segment 19 100 GB 50 GB 10 GB 0 2 8 32 128 512 2K 8K 32K 128K File Size Interpolation (75 GB) Each distribution broken down 0.12into segments Fraction of bytes File Size interpolation 75GB 0.12 0.1 0.08 0.06 0.08 Data points within a segment used for curve fit 0.04 0.02 0.02 Combine segment interpolations Extrapolated for new curve 0 R I Real 8 2K 512K 128M 32G File Size Interpolated Fraction of bytes 512K 2M 0.1 0.06 0.04 8M 0 32M 128M 512M 2G 8G 32G 128G File Size extrapolation Extrapolation (125 125GB GB) R E Real 8 2K 512K 128M 32G 25 File Size

File content Files having natural language content Word-popularity model (heavy tailed) Word-length frequency model (for the long tail) Content for other files (mp3, gif, mpeg etc) Impressions generates valid header/footer Uses third-party libraries and software 26

Disk layout and fragmentation 27

Disk layout and fragmentation Simplistic technique Layout Score for degree of fragmentation [Smith97] Pairs of file create/delete operations till desired layout score is achieved In future more File 1 nuanced ways are File possible 2 Out-of-order file writes, writes with long delays Access to file-system specific interfaces 1 non-contiguous FIPMAP in block Linux, (out XFS_IOC_GETBMAP of 8) All blocks for contiguous XFS File Layout Score = 7/8 File Layout Score = 1 (6/6) Perhaps a tool complementary to Impressions 28

Outline Introduction Generating realistic file-system images Applying Impressions: Desktop search Conclusion 29

Applying Impressions Case study: desktop search Google desktop for linux (GDL) and Beagle Metrics of interest: Size of index, time to build initial search index Identifying application bugs and policies GDL doesn t index content beyond 10-deep Computing realistic rules of thumb Overhead of metadata replication? 30

Impact of file content Index Size Comparison Index Size/FS size 0.1 0.01 Text (1 Word) Text (Model) Binary Beagle GDL File Understanding content has design: significant GDL affect: index around smaller 300% than increase Beagle for in index text files, size for larger both for GDL binary & Beagle files 31

Impact of metadata and content Relative Index Size 3.5 3 2.5 2 1.5 1 0.5 0 Original Beagle: Index Size TextCache Default Text Image Binary DisDir DisFilter Developer aid: understanding impact of different file system content & different index schemes 32

Impact of metadata and content Relative Index Size 3.5 3 2.5 2 1.5 1 0.5 0 Original Beagle: Index Size TextCache Beagle Default Text Image Binary DisDir DisFilter Future App Reproducing identical file-system image to compare other apps or ones developed later 33

Conclusion Impressions framework for realistic FS images Representative, controllable, reproducible, easy to use Includes almost all file system params of interest Extensible architecture Plug in new statistical constructs, new models for metadata and content generation Powerful utility for file-system benchmarking To be contributed publicly (coming soon) http://www.cs.wisc.edu/adsl/software/impressions 34

Questions? Nitin Agrawal http://www.cs.wisc.edu/~nitina Impressions download (coming soon) http://www.cs.wisc.edu/adsl/software/impressions i 35