Generating Realistic Impressions for File-System Benchmarking. Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Generating Realistic Impressions for File-System Benchmarking Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

For better or for worse," benchmarks shape a field " David Patterson 2

Inputs to file-system benchmarking Application Disk layout FS logical organization File System Storage device Input: Benchmark workload Postmark, FileBench, Fstress, Bonnie, IOZone, TPCC, etc etc Input: In-memory state Cold cache/warm cache Input: File-System Image Anything goes! 3

FS images in past: use what is convenient Typical desktop file system w/ no description (SOSP 05) 5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04) Randomly generated files of several MB (FAST 08) 1000 files in 10 dirs w/ random data (SOSP 03) 188GB and 129GB volumes in Engg dept (OSDI 99) 10702 files from /usr/local, size 354MB (SOSP 01) 1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

Performance of find operation Relative Time Taken Disk layout & cache state File-system logical organization 5

Problem scope Characteristics of file-system images have strong impact on performance We need to incorporate representative file-system images in benchmarking & design How to create representative file-system images? 6

Requirements for creating FS images Access to data on file systems and disk layout Properties of file-system metadata [Satyanarayan81, Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07] Disk fragmentation [Smith97] More such studies in future? A technique to create file-system images that is Representative: given a set of input distributions Controllable: supply additional user constraints Reproducible: control & report internal parameters Easy to use: for widespread adoption and consensus 7

Introducing Impressions Powerful statistical framework to generate file-system images Takes properties of file-system attributes as input Works out underlying statistical details of the image Mounted on a disk partition for real benchmarking Satisfies the four design goals Applying Impressions gives useful insights What is the impact on performance and storage size? How does an application behave on a real FS image? 8

Outline Introduction Generating realistic file-system images Applying Impressions: Desktop search Conclusion 9

Overview of Impressions Impressions 10

Properties of file-system metadata Five-year study of file-system metadata [FAST07] (Agrawal, Bolosky, Douceur, Lorch) Used as exemplar for metadata properties in Impressions 11

Features of Impressions Modes of operation for different usages Basic mode: choose default settings for parameters Advanced mode: several individually tunable knobs Thorough statistical machinery ensures accuracy Uses parameterized curve fits Allows arbitrary user constraints Built-in statistical tests for goodness-of-fit Generates namespace, metadata, file content, and disk fragmentation using above techniques 12

Creating valid metadata Creating file-system namespace Uses Generative Model proposed earlier [FAST 07] Explains the process of directory tree creation Accurately regenerates distribution of directory size and of directory depth 13

Creating namespace Cumulative Fraction of % directories of Probability of parent selection Count(subdirs)+2 Dirs Dirs by namespace by subdir count depth Directories by by Subdirectory Namespace Count Depth 100 0.18 0.16 D 0.14 90 G 0.12 80 0.1 0.08 70 0.06 Dataset 0.04 60 D 0.02 Generated G 500 00 22 44 6 8 10 12 14 16 Namespace depth (bin size 1) i Count of subdirectories Directory tree Monte Carlo run Incorporates dirs by depth and dirs by subdir count 14

Creating valid metadata Creating file-system namespace Creating files: stepwise process File size, file extension, file depth, parent directory Uses statistical models & analytical approximations 15

Example: creating realistic file sizes Contribution to used space File Sizes Lognormal Hybrid Containing file size (bytes, log scale) Pure lognormal distribution no longer good fit Hybrid model: lognormal body, Pareto tail Fits observed data more accurately, used to recreate file sizes in Impressions 16

Creating files Files by Files containing by Containing bytes Bytes Fraction of bytes 0.12 0.1 0.08 0.06 0.04 0.02 0 D G 0 8 2K 512K 512M128G File Size (bytes, log scale) S 8 S 9 S 7 S 6 S 5 S 4 S 3 S 2 S 1 i File Size Model Lognormal body, Pareto tail Captures bimodal curve 17

Creating files Top extensions Top Extensions by Count count 1 Fraction of files 0.8 0.6 0.4 0.2 0 Desired others txt null jpg htm h gif exe dll cpp others txt null jpg htm h gif exe dll cpp Generated S 9 S 8 E 9 E 8 S 7 S 6 E 7 E 6 S 5 E 5 S 4 S 3 S 2 S 1 E 4 E 3 E 2 E 1 i File Extensions Percentile values Top 20 extensions account for 50% of files and bytes 18

Creating files S 9 S 8 E 9 E 8 D 9 D 8 S 7 S 6 S 5 E 7 E 6 E 5 D 7 D 6 D 5 S 4 S 3 S 2 S 1 E 4 E 3 E 2 E 1 D 4 D 3 D 2 D 1 Mean Fraction bytes of per files file (log scale) Bytes by namespace depth Files by Files Bytes namespace by by Namespace depth Depth 0.16 0.14 D 0.12 2MB G 768KB 0.1 256KB 0.08 0.06 0.04 64KB 0.02 16KB 0 0 02 2 4 4 66 88 10 12 14 16 i Namespace depth (bin size 1) 1) File Depth Poisson Multiplicative model along with bytes by depth 19

Creating files Fraction of files Files by namespace depth 0.25 0.2 0.15 0.1 0.05 i 0 Files by Namespace Depth (with Special Directories) w/ special dirs D G 0 2 4 6 8 10 12 14 16 Namespace depth (bin size 1) Parent Dir Inverse Polynomial Satisfies distribution of dirs with file count 20

Resolving arbitrary constraints 21

Resolving arbitrary constraints Fraction of files 0.15 0.1 0.05 0 OC Original Constrained 8 2K 512K 8M File Size (bytes, log scale) Contrived for sum Constraint: Given count of files & size distribution, ensure Accurate both for the sum and the distribution sum of file sizes matches a desired total file system size 22

Resolving arbitrary constraints Arbitrarily specified on file system parameters Variant of NP-complete Subset Sum Problem Approximation algorithm based solution (in paper) Oversampling to get additional sample values Local improvement to iteratively converge to the desired sum by identifying best-fit in current sample While constraints are satisfied, constrained distribution also retains original characteristics 23

Interpolation and extrapolation Why don t we just use available data values? Limited to empirical data in input dataset What-if analysis beyond available dataset Efficient to maintain compact curve fits and use interpolation/extrapolation instead of all data Technique: Piecewise interpolation 24

Interpolation technique & accuracy 0.14 Piecewise Interpolation Piecewise Interpolation Fraction of bytes 0.12 0.1 0.08 0.06 0.04 0.02 0 Segment Value 0.06 0.04 0.02 0 Interpola9on: Seg 19 0 50 100 File System Size (GB) Segment 19 100 GB 50 GB 10 GB 0 2 8 32 128 512 2K 8K 32K 128K File Size Interpolation (75 GB) Each distribution broken down 0.12into segments Fraction of bytes File Size interpolation 75GB 0.12 0.1 0.08 0.06 0.08 Data points within a segment used for curve fit 0.04 0.02 0.02 Combine segment interpolations Extrapolated for new curve 0 R I Real 8 2K 512K 128M 32G File Size Interpolated Fraction of bytes 512K 2M 0.1 0.06 0.04 8M 0 32M 128M 512M 2G 8G 32G 128G File Size extrapolation Extrapolation (125 125GB GB) R E Real 8 2K 512K 128M 32G 25 File Size

File content Files having natural language content Word-popularity model (heavy tailed) Word-length frequency model (for the long tail) Content for other files (mp3, gif, mpeg etc) Impressions generates valid header/footer Uses third-party libraries and software 26

Disk layout and fragmentation 27

Disk layout and fragmentation Simplistic technique Layout Score for degree of fragmentation [Smith97] Pairs of file create/delete operations till desired layout score is achieved In future more File 1 nuanced ways are File possible 2 Out-of-order file writes, writes with long delays Access to file-system specific interfaces 1 non-contiguous FIPMAP in block Linux, (out XFS_IOC_GETBMAP of 8) All blocks for contiguous XFS File Layout Score = 7/8 File Layout Score = 1 (6/6) Perhaps a tool complementary to Impressions 28

Outline Introduction Generating realistic file-system images Applying Impressions: Desktop search Conclusion 29

Applying Impressions Case study: desktop search Google desktop for linux (GDL) and Beagle Metrics of interest: Size of index, time to build initial search index Identifying application bugs and policies GDL doesn t index content beyond 10-deep Computing realistic rules of thumb Overhead of metadata replication? 30

Impact of file content Index Size Comparison Index Size/FS size 0.1 0.01 Text (1 Word) Text (Model) Binary Beagle GDL File Understanding content has design: significant GDL affect: index around smaller 300% than increase Beagle for in index text files, size for larger both for GDL binary & Beagle files 31

Impact of metadata and content Relative Index Size 3.5 3 2.5 2 1.5 1 0.5 0 Original Beagle: Index Size TextCache Default Text Image Binary DisDir DisFilter Developer aid: understanding impact of different file system content & different index schemes 32

Impact of metadata and content Relative Index Size 3.5 3 2.5 2 1.5 1 0.5 0 Original Beagle: Index Size TextCache Beagle Default Text Image Binary DisDir DisFilter Future App Reproducing identical file-system image to compare other apps or ones developed later 33

Conclusion Impressions framework for realistic FS images Representative, controllable, reproducible, easy to use Includes almost all file system params of interest Extensible architecture Plug in new statistical constructs, new models for metadata and content generation Powerful utility for file-system benchmarking To be contributed publicly (coming soon) http://www.cs.wisc.edu/adsl/software/impressions 34

Questions? Nitin Agrawal http://www.cs.wisc.edu/~nitina Impressions download (coming soon) http://www.cs.wisc.edu/adsl/software/impressions i 35