Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

Similar documents
New HPE 3PAR StoreServ 8000 and series Optimized for Flash

NetApp Data Compression, Deduplication, and Data Compaction

The Google File System

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

NPTEL Course Jan K. Gopinath Indian Institute of Science

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Disks and Aggregates Power Guide

The Google File System

Improving throughput for small disk requests with proximal I/O

CSE 124: Networked Services Fall 2009 Lecture-19

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

COS 318: Operating Systems. Journaling, NFS and WAFL

Outline: ONTAP 9 Cluster Administration and Data Protection Bundle (CDOTDP9)

The Google File System

Google File System. Arun Sundaram Operating Systems

The Google File System

CSE 124: Networked Services Lecture-16

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Speeding Up Cloud/Server Applications Using Flash Memory

The Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Deduplication File System & Course Review

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Using Transparent Compression to Improve SSD-based I/O Caches

CA485 Ray Walshe Google File System

ONTAP 9 Cluster Administration. Course outline. Authorised Vendor e-learning. Guaranteed To Run. DR Digital Learning. Module 1: ONTAP Overview

CLOUD-SCALE FILE SYSTEMS

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

2012 Enterprise Strategy Group. Enterprise Strategy Group Getting to the bigger truth. TM

The Google File System (GFS)

A Thorough Introduction to 64-Bit Aggregates

Deduplication Storage System

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Flashed-Optimized VPSA. Always Aligned with your Changing World

Hedvig as backup target for Veeam

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

A Thorough Introduction to 64-Bit Aggregates

Today s trends in the storage world. Jacint Juhasz Storage Infrastructure Architect

Data ONTAP 8.1 Storage Efficiency Management Guide for 7-Mode

FLASHARRAY//M Business and IT Transformation in 3U

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Software-Defined Data Infrastructure Essentials

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

64-Bit Aggregates. Overview and Best Practices. Abstract. Data Classification. Technical Report. GV Govindasamy, NetApp April 2015 TR-3978

Google File System. By Dinesh Amatya

NEC M100 Frequently Asked Questions September, 2011

SolidFire. Petr Slačík Systems Engineer NetApp NetApp, Inc. All rights reserved.

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

Distributed File Systems II

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Powering Next-Gen Shared Accelerated Storage

GFS-python: A Simplified GFS Implementation in Python

HyperFlex. Simplifying your Data Center. Steffen Hellwig Data Center Systems Engineer June 2016

Storage Performance Validation for Panzura

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

PRESENTATION TITLE GOES HERE

The Fastest And Most Efficient Block Storage Software (SDS)

Hitachi Virtual Storage Platform Family

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Multi-Tier Subsystem Management Using SLOs. Kaladhar Voruganti Technical Director, CTO Office NetApp, Sunnyvale January 29th, 2013

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

ONTAP 9.3 Cluster Administration and Data Protection Bundle (CDOTDP9)

Clustered Data ONTAP 8.3 Administration and Data Protection Workshop

Current Topics in OS Research. So, what s hot?

Benefits of Multi-Node Scale-out Clusters running NetApp Clustered Data ONTAP. Silverton Consulting, Inc. StorInt Briefing

Cluster-Level Google How we use Colossus to improve storage efficiency

Cisco HyperFlex Systems and Veeam Backup and Replication

The Google File System

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Chapter 10: Mass-Storage Systems

Distributed Filesystem

vsan 6.6 Performance Improvements First Published On: Last Updated On:

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

HCI: Hyper-Converged Infrastructure

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Virtual Storage Tier and Beyond

Nimble Storage Adaptive Flash

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

GFS: The Google File System

Hyper-Convergence De-mystified. Francis O Haire Group Technology Director

Distributed Systems 16. Distributed File Systems II

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc.

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Introduction to Database Services

Design Considerations for Using Flash Memory for Caching

On-Premises Cloud Platform. Bringing the public cloud, on-premises

The Google File System GFS

SMORE: A Cold Data Object Store for SMR Drives

Transcription:

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL Ram Kesavan Technical Director, WAFL NetApp, Inc. SDC 2017 1

Outline Garbage collection in WAFL Usenix FAST 2017 ACM Transactions on Storage (coming soon) Selected topics in WAFL recently published papers 2

WAFL: 25+ Years! Technology trends MHz -> GHz cores, 1 core -> 40+ cores HDDs (5k->15k rpm, MB->TB), SSDs (100GiB->32TiB+), Cloud, SCM Max file system size: 28 GiB -> 400 TiB Applications NFS/SMB file sharing -> LUNs with FC/iSCSI access Virtualization over block & file access Many data management features snapshots, clones, replication, dedupe, compression, mobility, encryption, etc. Mixed workload deployment ONTAP node: 100 s of WAFL file systems, 100 s TiB, 1000 s of LUNs, etc. 3

Free Space Metadata Reasons for tracking free space find & allocate blocks for new writes report usage for provisioning/purchasing decisions Design dimensions: in-memory vs persistent lazy vs contemporaneous Consistent performance is key: balanced resource usage user ops vs free space management (& other backend work) CPU, I/O, memory, etc. 4

WAFL: Layout & Transactional Model File system is a tree of blocks everything is a file, including metadata Log-structured & copy-on-write each dirty buffer written to a newly allocated block except the superblock Large atomic transactions: ~GB worth of dirty buffers ~2s to 5s long persistent file system always self-consistent consistency point (CP) 5

Atomic Update of the Persistent File System CP i CP i+1 superblk superblk A B B C D E E 6

Atomic Update of the Persistent File System CP i superblk CP i+1 superblk 1 GB/s writes need 1 GB/s allocations of new blocks 1 GB/s frees of unused blocks A B B C D E E 7

Activemap Bits: used(1) -> free(0) Long CPs due to: random frees -> more dirty activemap blocks long activemap chains Long CPs hurt write throughput 4 activemap blocks of CP i A B Block A Block B Block C bit in A is cleared in CP i+1 C D Block D A B C D Block A Block B Block C Block D Free Block Used Block Used Block in Chain 8

Append Log using L 1 s of a File Delay updates of activemap to avoid the chaining problem Convert several random to few batched TLog( 08): binary sort tree using L 1 s of log file dropped due to unpredictable performance BFLog( 12): 3 files - active, inactive (sorting), sorted predictably pace sorting and freeing activity Log sized to ~0.5% of file system provides sufficient batching Blocks freed by file/lun deletions, SCSI hole-punching 9

Logging Results 4 4 Latency (msec) 3.5 3 2.5 2 1.5 1 0.5 0 BFLog Off Throughput BFLog On Throughput 20 30 40 50 60 70 80 Achieved Throughput (thousands of IOPS) Achieved GiB/s 3.5 3 2.5 2 1.5 1 0.5 0 Unaged Delete Logging Off Aged Delete Logging On 17% higher write throughput at 34%-48% lower latency 6% to 60% (unaged v aged) raw deletion throughput Much higher delete throughput with little interference (not shown) 10

Snapshots S 1 superblk superblk S 0 (current FS) A B B volume = current FS + all snapshots C D E Activemap for S 1 A B C D E B E E Activemap for S 0 (current FS) A B C D E B E Free Block Used Block 11

Summary Map Block is free iff every activemap says it is free Summary Map = Bit-OR of snapshots activemaps metafile in the current file system Block is free iff active & summary bits are both 0 Summary can become stale: snapshot creation, deletion, etc. sequential scan to fixup summary correctness is preserved while scan is in progress Fast reclamation of space without impacting other operations 12

13 WAFL Layering: FlexVol & Aggregates FlexVolume superblk Aggregate superblk A Container B M 1 A N B C P O D E C D E Activemap for Flexvol A B C D E Q R S T WU Activemap for Aggr M N O P Q R S T U V W

Bunched Delayed Frees & Logging Interaction Bunched delayed frees: 60% lower latency at any load & higher saturation point Optimized processing via FlexVol Log: 84% of the container blocks punched out in optimized mode -> 26% higher throughput 14

Other Topics in WAFL 15

WAFL Block Allocation WAFL & ONTAP configurations: all-hdd, Flash Pool: SSD+HDD, all-ssd, CloudOntap (AWS/Azure), Select: software defined, Fabric Pool: SSD+ObjectStore (on-prem, AWS/Azure) Allocation policies based on temperature Client access patterns, policies, snapshots, replicas, etc. Flexible Block Allocator Incorporate new media Test & productize new policies quickly MP: scale-up with #cores ICPP 2017: Scalable Write Allocation in the WAFL File System 16

Scaling with CPU cores ONTAP was originally single-threaded 2 WAFL threads: client ops vs CP/internal WAFL MP model Needed incremental MP-ization of existing code Buffers/inodes: fine-grained locking vs data/metadata partitions Frequent ops first: read, write, lookup, readdir, etc. Hierarchical data partitioning All ops: user & internal WAFL thread creation/exit is dynamic OSDI 2016: To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code 17

WAFL Buffer Cache Mostly LRU buffer cache User data vs metadata; pinning, tracking, etc. Speculative readahead Tricks to read from SSD, HDD, etc. MP access: insert, read, write, scavenge ICPP 2016: Think Global, Act Local: A Buffer Cache Design for Global Ordering and Parallel Processing in the WAFL File System 18

WAFL Resiliency Storage devices misbehave Media errors, failures, lost/redirected writes RAID-DP, checksum + context Software/hardware bugs: scribbles vs file system logic bugs Prevent corruption from getting persisted Use transactional nature of WAFL Verify delta equations And, with negligible performance impact FAST 2017: High Performance Metadata Integrity Protection in the WAFL Copy-on- Write File System 19

Other Topics Scale-out across an ONTAP cluster: FlexGroups Recovery: WAFL Iron for online repair Data management topics Storage efficiency: deduplication, compression, compaction Replication, retention, data mobility Instantaneous cloning: file & file system granular Software encryption Revert & non-disruptive upgrade Upcoming projects: new media, Plexistor integration, etc. 20

Conclusion WAFL has evolved over 2+ decades deployments, media/hardware trends, applications/workloads Fundamental designs some stood the test of time some changed with requirements/trends Consistent & predictable performance with low latency and high throughput Fun & difficult problems some have been solved several new problems 21