Emulating Goliath Storage Systems with David

Similar documents
Emulating Goliath Storage Systems with David

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Announcements. Persistence: Log-Structured FS (LFS)

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Coerced Cache Evic-on and Discreet- Mode Journaling: Dealing with Misbehaving Disks

FS Consistency & Journaling

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Designing a True Direct-Access File System with DevFS

Membrane: Operating System support for Restartable File Systems

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Announcements. Persistence: Crash Consistency

CS 318 Principles of Operating Systems

TEFS: A Flash File System for Use on Memory Constrained Devices

[537] Journaling. Tyler Harter

Physical Disentanglement in a Container-Based File System

ECE 598 Advanced Operating Systems Lecture 14

An Exploration of New Hardware Features for Lustre. Nathan Rutman

DenseFS: A Cache-Compact Filesystem

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Operating Systems. File Systems. Thomas Ropars.

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

CS 318 Principles of Operating Systems

Zettabyte Reliability with Flexible End-to-end Data Integrity

Copyright 2014 Fusion-io, Inc. All rights reserved.

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Generating Realistic Impressions for File-System Benchmarking. Nitin Agrawal Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

Tolera'ng File System Mistakes with EnvyFS

Analysis and Evolution of Journaling File Systems

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison

Operating Systems. Operating Systems Professor Sina Meraji U of T

CSE 153 Design of Operating Systems

The Btrfs Filesystem. Chris Mason

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

W4118 Operating Systems. Instructor: Junfeng Yang

EECS 482 Introduction to Operating Systems

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Lecture S3: File system data layout, naming

CS 4284 Systems Capstone

Fast File, Log and Journaling File Systems" cs262a, Lecture 3

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Using Transparent Compression to Improve SSD-based I/O Caches

CS 318 Principles of Operating Systems

Linux Internals For MySQL DBAs. Ryan Lowe Marcos Albe Chris Giard Daniel Nichter Syam Purnam Emily Slocombe Le Peter Boros

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

Towards Efficient, Portable Application-Level Consistency

CS370 Operating Systems

File System Internals. Jo, Heeseung

Linux SMR Support Status

<Insert Picture Here> Btrfs Filesystem

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Ext4 Filesystem Scaling

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Improving Virtualized Storage Performance with Sky

Main Points. File layout Directory layout

Filesystem Performance on FreeBSD

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSL373/CSL633 Major Exam Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

File System Forensics : Measuring Parameters of the ext4 File System

Clotho: Transparent Data Versioning at the Block I/O Level

An SMR-aware Append-only File System Chi-Young Ku Stephen P. Morgan Futurewei Technologies, Inc. Huawei R&D USA

Semantically-Smart Disk Systems

Warming up Storage-level Caches with Bonfire

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Lecture 18: Reliable Storage

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

The TokuFS Streaming File System

CS 550 Operating Systems Spring File System

Recovering Disk Storage Metrics from low level Trace events

Analysis of high capacity storage systems for e-vlbi

<Insert Picture Here> Filesystem Features and Performance

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Computer Systems Laboratory Sungkyunkwan University

COMP 530: Operating Systems File Systems: Fundamentals

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Toward An Integrated Cluster File System

File Systems 1. File Systems

File Systems 1. File Systems

Name: Instructions. Problem 1 : Short answer. [63 points] CMU Storage Systems 12 Oct 2006 Fall 2006 Exam 1

File System Implementation

CS370 Operating Systems

CA485 Ray Walshe Google File System

Cost-Effective Virtual Petabytes Storage Pools using MARS. FrOSCon 2017 Presentation by Thomas Schöbel-Theuer

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

CSL373/CSL633 Major Exam Solutions Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

University of Wisconsin-Madison

CS5460: Operating Systems Lecture 20: File System Reliability

Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems

Computer Engineering II Solution to Exercise Sheet 9

I/O and file systems. Dealing with device heterogeneity

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Distributed Filesystem

Transcription:

Emulating Goliath Storage Systems with David Nitin Agrawal, NEC Labs Leo Arulraj, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau ADSL Lab, UW Madison 1

The Storage Researchers Dilemma Innovate Create the future of storage Measure Quantify improvement obtained Dilemma How to measure future of storage with devices from present?

David: A Storage Emulator Large, fast, multiple disks using small, slow, single device Huge Disks ~1TB disk using 80 GB disk Multiple Disks RAID of multiple disks using RAM

Key Idea behind David Store metadata, throw away data (and generate fake data) Why is this OK? Benchmarks measure performance Many benchmarks don t care about file content Some expect valid but not exact content

Outline Intro Overview Design Results Conclusion

Overview of how David works Benchmark Filesystem Userspace Kernelspace DAVID (Pseudo Block Device Driver) Storage Model Backing Store

Illustrative Benchmark Create a File Write a block of data Close the File Open file in read mode Read back the data Close the File

How does David handle metadata write? Benchmark F = fopen( a.txt, w ); Filesystem Allocate Inode in block 100 Storage Model Backing Store

How does David handle metadata write? Benchmark Filesystem 100 Inode block LBA : 100 Storage Model Backing Store

How does David handle metadata write? Benchmark Filesystem 100 100 Storage Model Backing Store

How does David handle metadata write? Benchmark Filesystem Model calculates response time for write to LBA 100 Metadata block at LBA 100 is remapped to LBA 1 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle metadata write? Benchmark Filesystem Response to FS after 6 ms 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data write? Benchmark Filesystem fwrite(buffer, 4096,1,F); 800 Data block LBA : 800 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data write? Benchmark Filesystem 800 800 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data write? Benchmark Filesystem Model calculates response time for write to LBA 800 Data block at LBA 800 is THROWN AWAY 800 Storage Model 1 Remap Table 100 1 Backing Store

How does David handle data write? Benchmark Filesystem Response to FS after 8 ms Space Savings 50% 800 Storage Model 1 Remap Table 100 1 Backing Store

How does David handle metadata read? Benchmark Filesystem F = fclose(f); F = fopen( a.txt, r ); 1 Remap Table 100 1 Storage Model Backing Store

How does David handle metadata read? Benchmark Filesystem 100 Inode block LBA : 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle metadata read? Benchmark Filesystem 100 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle metadata read? Benchmark Filesystem Model calculates response time for read to LBA 100 Block at LBA 1 is read and returned. 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle metadata read? Benchmark Filesystem Response to FS after 3 ms 100 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data read? Benchmark fread(buffer, 4096,1,F); Filesystem 800 Data block LBA : 800 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data read? Benchmark Filesystem 800 800 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data read? Benchmark Filesystem Model calculates response time for read to LBA 800 Data block at LBA 800 is filled with fake content 800 1 Remap Table 100 1 Storage Model Backing Store

How does David handle data read? Benchmark Filesystem Response to FS after 8 ms 800 1 Remap Table 100 1 Storage Model Backing Store

Outline Intro Overview Design Results Conclusion

Design Goals for David Accurate Emulated disk should perform similar to real disk Scalable Should be able to emulate large disks Lightweight Emulation overhead should not affect accuracy Flexible Should be able to emulate variety of storage disks Adoptable Easy to install and use for benchmarking

Components within David Block Classifier Storage Model Metadata Remapper Data Squasher Data Generator Backing Store

Block Classification Data or Metadata? Distinguish data blocks from metadata blocks to throw away data blocks Why difficult? David is a block-level emulator Two Approaches Implicit Block Classification (David automatically infers block classification) Explicit Block Classification (Operating System passes down block classification)

Implicit Block Classification Parse metadata writes using filesystem knowledge to infer data blocks Implementation for ext3 Identify inode blocks using ext3 block layout Parse inode blocks to infer direct/indirect blocks Parse direct/indirect blocks to infer data blocks Problem Delay in classification

Ext3 Ordered Journaling Mode (without David) M D Journal Disk

Ext3 Ordered Journaling Mode (with David) Unclassified Block Store Journal Disk

Memory Pressure in Unclassified Block Store Too many unclassified blocks exhaust memory Technique: Journal Snooping Parse metadata writes to journal to infer classification much earlier than usual

Memory Used (MB) Effect of Journal Snooping 2000 Without Journal Snooping With Journal Snooping Out of Memory 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Time (seconds)

Block Classification Data or Metadata? Distinguish data blocks from metadata blocks to throw away data blocks Why difficult? David is a block-level emulator Two Approaches Implicit Block Classification (David automatically infers block classification) Explicit Block Classification (Operating System passes down block classification)

Explicit Block Classification Data Blocks Benchmark Application Metadata Blocks FileSystem To David Capture page pointers to data blocks in the write system call and pass classification information to David

Block Classification Summary Implicit Block Classification No change to filesystem, benchmark or operating system Requires filesystem knowledge Results with ext3 Explicit Block Classification Minimal change to operating system Works for all filesystems Results with btrfs

Components within David Block Classifier Storage Model Metadata Remapper Data Squasher Data Generator Backing Store

David s Storage Model Actual System Benchmark Emulated System Benchmark Filesystem I/O request queue Disk Storage Model Filesystem David

I/O Queue Model Merge sequential I/O requests To improve performance When I/O queue is empty Wait for 3 ms anticipating merges When I/O queue is full Process is made to sleep and wait Process is woken up once empty slots open up Process is given a bonus for the wait period I/O queue modeling critical for accuracy

Disk Model Simple in-kernel disk model Based on Ruemmler and Wilkes disk model Current models: 80GB and 1 TB Hitachi deskstar Focus of our work is not disk modeling (more accurate models are possible) Disk model parameters Disk properties Rotational speed, head seek profile, etc. Current disk state Head position, on-disk cache state, etc.

David s Storage Model Accuracy Reasonable accuracy across many workloads Many more results in paper

Components within David Block Classifier Storage Model Metadata Remapper Data Squasher Data Generator Backing Store

Backing Store Storage space for metadata blocks Any physical storage can be used Must be large enough to hold all metadata blocks Must be fast enough to match emulated disk Two implementations Memory as backing store Compressed disk as backing store

Metadata Remapper Remaps metadata blocks into compressed form Emulated Disk Inode Data Inode Data Inode Data Inode Inode Inode Compressed Disk (better performance)

Components within David Block Classifier Storage Model Metadata Remapper Data Squasher Data Generator Backing Store

Data Squasher and Generator Data Squasher Throws away writes to data blocks Data Generator Generate content for the reads to data blocks (currently generates random content)

Outline Intro Overview Design Results Conclusion

Experiments Emulation accuracy Test emulation accuracy across benchmarks Emulation scalability Test space savings for large device emulation Multiple disk emulation Test accuracy of multiple device emulation

Emulation Accuracy Experiment Experimental details Emulated ~1 TB disk with 80 GB disk Ran a variety of benchmarks Validated by using a real 1 TB disk

Runtime (seconds) Emulation Accuracy Results (Ext3 with Implicit Block Classification) 400 350 300 250 200 150 100 50 0 Real Emulated

Runtime (seconds) Emulation Accuracy Results (Btrfs with Explicit Block Classification) 350 300 250 200 150 100 50 0 Real Emulated

Emulation Scale Experiment Experimental details Emulated ~1 TB disk using a 80 GB disk Created filesystem images using Impressions Validated by using a real disk

Emulation Scale: Accuracy

Emulation Scale: Space Savings

Multiple Disks Experiment Experimental details Emulated multiple disks using RAM Measured micro-benchmark performance on RAID-1 Validated our results against real disks

Runtime (seconds) Simple RAID-1 Emulation 300 Random Read or Write Performance 250 200 150 100 Original David 50 0 R/3 W/3 R/2 W/2 R/1 W/1

Outline Intro Overview Design Results Conclusion

Conclusion David: Emulate large devices with limited means Key idea: Throw away data Results: Accurate emulation of large and multiple disks Future: Emulating storage cluster with few machines

Thank You www.cs.wisc.edu/adsl

Questions?

Measuring Innovation Thorough measurement is Hard and Costly Time, Money, Effort needed to measure performance on a variety of storage devices Tiny benchmarks are easy to run

Implicit Block Classification Unclassified block store Unclassifiable blocks are temporarily stored in Unclassified Block Store which is in RAM Journal checkpoint frequency determines the delay in classification Upon classification, data blocks are squashed and metadata blocks are persisted