HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

Similar documents
HYDRAstor: a Scalable Secondary Storage

The Google File System

C13: Files and Directories: System s Perspective

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

File Systems. Chapter 11, 13 OSPP

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

KVZone and the Search for a Write-Optimized Key-Value Store

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

CS370 Operating Systems

Buffer Management for XFS in Linux. William J. Earl SGI

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

CS 537 Fall 2017 Review Session

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

2. PICTURE: Cut and paste from paper

Caching and Demand-Paged Virtual Memory

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

OPERATING SYSTEM. Chapter 12: File System Implementation

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Chapter 11: Implementing File Systems

Operating Systems CMPSC 473 File System Implementation April 10, Lecture 21 Instructor: Trent Jaeger

File System Implementation

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CS3600 SYSTEMS AND NETWORKS

Chapter 12: File System Implementation

STORING DATA: DISK AND FILES

Chapter 10: File System Implementation

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

CSE506: Operating Systems CSE 506: Operating Systems

Chapter 11: Implementing File-Systems

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

HYDRAstor: a Scalable Secondary Storage

CA485 Ray Walshe Google File System

File Systems. CS170 Fall 2018

JANUARY 20, 2016, SAN JOSE, CA. Microsoft. Microsoft SQL Hekaton Towards Large Scale Use of PM for In-memory Databases

Lecture 18: Reliable Storage

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

Caching and reliability

What is a file system

ECE 598 Advanced Operating Systems Lecture 18

Linux Filesystems Ext2, Ext3. Nafisa Kazi

CSE 153 Design of Operating Systems

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS 550 Operating Systems Spring File System

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

File Systems Management and Examples

COS 318: Operating Systems. Journaling, NFS and WAFL

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 11: Implementing File

Can t We All Get Along? Redesigning Protection Storage for Modern Workloads

Chapter 12: File System Implementation

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text.

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CSE 190D Database System Implementation

CS3350B Computer Architecture

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

The Google File System

Storage hierarchy. Textbook: chapters 11, 12, and 13

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

Lecture 14: Cache & Virtual Memory

Week 12: File System Implementation

Purity: building fast, highly-available enterprise flash storage from commodity components

CFLRU:A A Replacement Algorithm for Flash Memory

Push-button verification of Files Systems via Crash Refinement

CS307: Operating Systems

VIRTUAL MEMORY READING: CHAPTER 9

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Directory-Aware File System Backup to Object Storage for Fast On-Demand Restore

CLOUD-SCALE FILE SYSTEMS

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Chapter 12 File-System Implementation

OPERATING SYSTEM. Chapter 9: Virtual Memory

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Chapter 10: Case Studies. So what happens in a real operating system?

CS 111. Operating Systems Peter Reiher

Question Points Score Total 100

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

Operating Systems. Operating Systems Professor Sina Meraji U of T

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

W4118 Operating Systems. Instructor: Junfeng Yang

Chapter 12: File System Implementation

SQL Server 2014: In-Memory OLTP for Database Administrators

Google File System. Arun Sundaram Operating Systems

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

What's new in MySQL 5.5? Performance/Scale Unleashed

File-System Structure

GFS-python: A Simplified GFS Implementation in Python

Operating Systems. File Systems. Thomas Ropars.

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Vnodes. Every open file has an associated vnode struct All file system types (FFS, NFS, etc.) have vnodes

Transcription:

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Steve Rago, Grzegorz Calkowski, Cezary Dubnicki, Aniruddha Bohra Feb 26, 2010

HYDRAstor: De-duplicated Scalable Storage Scale-out storage With global de-duplication Using Content-Defined Chunking Resilient to multiple failures Easy to manage (self-healing, ) High throughput for streaming access Std. interfaces (NFS/CIFS, VTL, ) FAST 10 HydraFS: a High Throughput Filesystem Bimodal CDC for Backup Streams Standard protocols Chunking High throughput Access Layer Content-addressable API FAST 09 HYDRAstor: a Scalable Secondary Storage Scalable Easy to manage Resilient High throughput Content-addressable Block Store 2

HYDRAstor Usage Example Block Store (CAS) API Variable-size blocks B1 File System Block Store 3

HYDRAstor Usage Example Block Store (CAS) API File System Variable-size blocks Content-addressable Address decided by the store CA 1 B1 Block Store 4

HYDRAstor Usage Example B2 File System Block Store (CAS) API Variable-size blocks Content-addressable Address decided by the store CA 1 B1 Block Store 5

HYDRAstor Usage Example Block Store (CAS) API File System Variable-size blocks Content-addressable Address decided by the store CA 1 CA 2 B1 B2 Block Store 6

HYDRAstor Usage Example B3 File System Block Store (CAS) API Variable-size blocks Content-addressable Address decided by the store CA 1 CA 2 B1 B2 Block Store 7

HYDRAstor Usage Example Block Store (CAS) API File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store CA 1 CA 1 CA 2 B3 B1 B2 Block Store 8

HYDRAstor Usage Example CA 1 File System Block Store (CAS) API Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store CA 1 CA 2 B3 B1 B2 Block Store 9

HYDRAstor Usage Example Block Store (CAS) API CA 1 CA 2 File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store CA 1 B3 B1 B2 Block Store 10

HYDRAstor Usage Example Block Store (CAS) API CA 1 CA 2 CA 1 File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store B1 B2 Block Store 11

HYDRAstor Usage Example B4 CA 1 CA 2 CA 1 File System Block Store (CAS) API Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store Configurable block resilience B1 B2 Block Store 12

HYDRAstor Usage Example Block Store (CAS) API File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store Configurable block resilience CA 3 B4 CA 1 CA 2 CA 1 B1 B2 Block Store 13

HYDRAstor Usage Example CA 3 File System Block Store (CAS) API Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store Configurable block resilience B4 CA 1 CA 2 CA 1 B1 B2 Block Store 14

HYDRAstor Usage Example Block Store (CAS) API Root 1 CA 3 File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store Configurable block resilience Garbage collection B4 CA 1 CA 2 CA 1 B1 B2 Block Store 15

HYDRAstor Usage Example Block Store (CAS) API Root 1 CA 3 File System Variable-size blocks Content-addressable Address decided by the store Duplicates eliminated by store Configurable block resilience Garbage collection B4 CA 1 CA 2 CA 1 B1 B2 Block Store 16

Outline HYDRAstor content-addressable API Challenges posed to the filesystem Filesystem architecture Techniques used to overcome the challenges Conclusions and future work 17

Challenges Content-addressable blocks A change in a block s contents also changes the block s address All metadata has to change, recursively up to the filesystem root Parent can only be written after the children writes are successful Variable-sized chunking (splitting file data into blocks) Block boundaries change when content is changed Overwrites cause read-rechunk-rewrite High-latency block store operations Why? Hashing, compression, erasure coding, fragment distribution Exacerbates the above two challenges 18

Persistent Layout Filesystem superblock (root block) Inode map root Inode map B-tree Inode map (segmented array) File inode Directory inode Inode B-tree Directory B-tree File contents Directory contents 19

HydraFS Architecture User operations File Server Control messages Filesystem Commit Server File System Root TS=1 TS=20 Block Store Metadata Data Update log TS=3; TS=2; TS=1; op 1, op 2,... 20

File Server Write buffer Accumulates written data; flushed on sync Helps re-order NFS packets arriving out-of-order Write Buffer (dirty data) 21

File Server Write buffer Accumulates written data; flushed on sync Helps re-order NFS packets arriving out-of-order Chunker Decides block boundaries (based on data content) Write Buffer (dirty data) Chunker 22

File Server Write buffer Accumulates written data; flushed on sync Helps re-order NFS packets arriving out-of-order Chunker Decides block boundaries (based on data content) Metadata modification records (file, directory, inode map) Dirty metadata annotated with time-stamp (for cleaning) Written out to log Large amount of dirty metadata! Requires efficient cleaning Resource management issues Write Buffer (dirty data) Chunker Metadata Modification Records File offset_range CA Directory additions/removals Inode map de/allocations (dirty metadata) 23

File Server Write buffer Accumulates written data; flushed on sync Helps re-order NFS packets arriving out-of-order Chunker Decides block boundaries (based on data content) Metadata modification records (file, directory, inode map) Dirty metadata annotated with time-stamp (for cleaning) Written out to log Block cache Clean data and metadata (not de-serialized) Write Buffer (dirty data) Chunker Metadata Modification Records File offset_range CA Directory additions/removals Inode map de/allocations (dirty metadata) Block Cache CA block data (clean data & metadata) 24

Write Processing Write Buffer Metadata Modification Records Block Cache Chunker Block Store 25

Write Processing [0,8 KB) Write Buffer Metadata Modification Records Block Cache Chunker Block Store 26

Write Processing Write Buffer Metadata Modification Records Block Cache [0, 8 KB) Chunker Block Store 27

Write Processing [8 KB,16 KB) Write Buffer Metadata Modification Records Block Cache [0, 8 KB) Chunker Block Store 28

Write Processing Write Buffer [8 KB,16 KB) [0, 8 KB) Chunker Metadata Modification Records Block Cache Block Store 29

Write Processing Write Buffer Metadata Modification Records Block Cache [12 KB, 16 KB) Chunker 12 KB of data Block Store 30

Write Processing Write Buffer Metadata Modification Records Block Cache [12 KB, 16 KB) Chunker CA 1 Data blocks 12 KB of data Block Store 31

Write Processing Write Buffer Metadata Modification Records Block Cache [12 KB, 16 KB) Chunker TS, [0, 12KB) CA 1 CA 1 Data blocks 12 KB of data Block Store 32

Write Processing [16 KB,24 KB) Write Buffer Metadata Modification Records Block Cache [12 KB, 16 KB) Chunker TS, [0, 12KB) CA 1 Data blocks 12 KB of data Block Store 33

Write Processing Write Buffer [16 KB,24 KB) [12 KB, 16 KB) Chunker Metadata Modification Records TS, [0, 12KB) CA 1 Block Cache Data blocks 12 KB of data Block Store 34

Write Processing Write Buffer Metadata Modification Records Block Cache [22 KB, 24 KB) Chunker TS, [12KB, 22KB) CA 2 TS, [0, 12KB) CA 1 Data blocks 12 KB of data 10 KB of data Block Store 35

Write Processing Write Buffer Metadata Modification Records Block Cache [22 KB, 24 KB) Chunker TS, [12KB, 22KB) CA 2 TS, [0, 12KB) CA 1 TS ; [0, 12KB) CA 1 ; [12KB, 22KB) CA 2 ; Data blocks 12 KB of data 10 KB of data Log block Block Store 36

Commit Server Filesystem superblock TS Inode map B-tree File inode Commit server does not read data 37

Commit Server Filesystem superblock TS Inode map B-tree File inode TS ; inode=2,[24kb, 32KB)=CA 1 TS ; inode=9,[24kb, 32KB)=CA 2 Log records 38

Commit Server TS TS Amortize updates over many log records Recovery time == the time to re-apply log 39

Metadata Cleaning File Modification Records inode #1; root= CA 101 ; min=791; max=791 791, [8KB, 18KB) CA 1 791, [18KB, 31KB) CA 2 inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 779 CA 101 Block Cache CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 40

Metadata Cleaning update( TimeStamp=802) File Modification Records inode #1; root= CA 101 ; min=791; max=791 791, [8KB, 18KB) CA 1 791, [18KB, 31KB) CA 2 inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 779 CA 101 Block Cache CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 41

Metadata Cleaning Read new, evict old superblock update( TimeStamp=802) File Modification Records inode #1; root= CA 101 ; min=791; max=791 779 802 Block Cache 791, [8KB, 18KB) CA 1 791, [18KB, 31KB) CA 2 inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 CA 101 CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 42

Metadata Cleaning Process dirty inodes one by one update( TimeStamp=802) File Modification Records inode #1; root= CA 101 ; min=791; max=791 791, [8KB, 18KB) CA 1 791, [18KB, 31KB) CA 2 802 Block Cache inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 CA 101 CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 43

Metadata Cleaning Case 1: 802 max evict entire inode update( TimeStamp=802) File Modification Records inode #1; root= CA 101 ; min=791; max=791 791, [8KB, 18KB) CA 1 791, [18KB, 31KB) CA 2 802 Block Cache inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 CA 101 CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 44

Metadata Cleaning Case 2: 802 min and 802 < max drop root CA drop records with time_stamp 802 update( TimeStamp=802) File Modification Records 802 Block Cache inode #2; root= CA 122 ; min=801; max=803 801, [8KB, 18KB) CA 3 CA 101 CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 45

Metadata Cleaning Case 3: 802 < min skip record processing (all are newer) inode root remains unchanged update( TimeStamp=802) File Modification Records 802 Block Cache inode #2; root=? ; min=801; max=803 CA 101 CA 122 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 46

Metadata Cleaning Locks only one inode at a time (no tree locking) No I/O done with the lock held File Modification Records 802 Block Cache inode #2; root=? ; min=801; max=803 CA 101 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 47

Read Processing read( inode=2, off=0, len=8kb) File Modification Records 802 Block Cache inode #2; root=? ; min=801; max=803 CA 101 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 48

Read Processing read( inode=2, off=0, len=8kb) File Modification Records 802 Block Cache inode #2; root=? ; min=801; max=803 CA 101 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 49

Read Processing read( inode=2, off=0, len=8kb) File Modification Records 802 Block Cache inode #2; root=ca 412 ; min=801; max=803 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 CA 101 CA 412 CA 122 50

Read Processing read( inode=2, off=0, len=8kb) File Modification Records 802 Block Cache inode #2; root=ca 412 ; min=801; max=803 803, [18KB, 29KB) CA 4 inode #3; root= CA 303 ; min=805; max=806 805, [0KB, 10KB) CA 1 806, [10KB, 21KB) CA 4 CA 101 CA 412 CA 122 51

Read Performance Just pre-fetch? Block Cache Problems High latency high read-ahead Poor cache locality for metadata Solutions Separate data and meta-data pre-fetch Weighted-LRU Policy for Block Cache 52

Data and Metadata Pre-fetch Problem: time to pre-fetch a data block varies with its position in the B-tree B15 Compare: B1 B2 with B4 B5 B1: B15 B11 B9 B1 B2: B15 B11 B9 B2 Likely cache miss B4: B15 B11 B10 B4 B5: B15 B14 B12 B5 Solution Pre-fetch metadata more aggressively than data B9 B11 B10 B12 B1 B2 B3 B4 B5 B6 Same file offset distance B14 B7 B13 B8 53

Weighted LRU Policy for Block Cache Problem: different access pattern for data and metadata blocks Data blocks being read Clean pages, pinned until read completes Looked-up once, then unlikely to be needed again (for streaming workloads) Data blocks pre-fetched Clean pages, not pinned Should avoid evicting before they are read Metadata blocks Looked-up more than once, but with large duration between accesses Solution: cache eviction policy that favors metadata blocks Insert Assign weight based on block type Lookup Reset to initial weight, and make MRU in that bucket Reclaim Evict blocks with zero weight; Decrease everybody else s weight with 1 54

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B3 B9 B11 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B4 B10 B14 B12 B15 B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 55

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B3 B9 B11 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B4 B10 B14 B12 B15 B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 56

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B3 B9 B11 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B4 B10 B14 B12 B15 B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 57

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B4 B9 B11 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B3 B6 B10 B14 B12 B15 B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 58

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B4 B9 B11 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B3 B6 B10 B14 B12 B15 Reclamation! B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 59

Weighted LRU Different block weights initial metadata block weight: initial data block weight: 3 1 0 1 2 B9 B11 B12 3 B15 Reclamation evict 0-weight blocks reduce all weights by 1 B3 B6 B10 B7 B14 B13 B15 B8 B11 B14 B9 B10 B12 B13 B1 B2 B3 B4 B6 B7 B8 60

Effectiveness of Read Path Optimizations Main techniques Pre-fetch metadata more aggressively than data Weighted-LRU to evict data more aggressively than metadata Experiment Read a large file Accesses Data Misses Metadata Throughput (MB/s) Base 486,966 1577 1011 134.3 Optimized 211,632 438 945 183.2 61

Resource Management Pre-allocated, managed Fixed-size pools of fixed-size objects pages are 4 KB inodes are 8 KB log blocks are up to 128 KB etc. Pages for data and blocks Inodes Metadata modification records Unmanaged heap Objects number is bound by that of some managed objects Log blocks Auxiliary objects 62

Resource Management Reserved: 0 Allocated: 0 Total: 10 Resource Event actions Admission condition: Requested + Reserved + Allocated Total 63

Resource Management Reserved: 0 Allocated: 0 Total: 10 Resource event s requirements determined by event type Event actions Event 1 created 4 + 0 + 0 10 64

Resource Management Reserved: 4 Allocated: 0 Total: 10 Resource Event actions Event 1 created Event 1 admitted 65

Resource Management Reserved: 4 Allocated: 4 Total: 10 Resource Event actions Event 1 created Event 1 admitted Event 1 allocates 4 66

Resource Management Reserved: 0 Allocated: 4 Total: 10 event s reservations are released may leave behind allocated resources Resource Event actions Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes 67

Resource Management Reserved: 0 Allocated: 4 Total: 10 Resource Event actions Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created 3 + 0 + 4 10 68

Resource Management Reserved: 3 Allocated: 4 Total: 10 Resource Event actions Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created Event 2 admitted 69

Resource Management Reserved: 3 Allocated: 6 Total: 10 Resource Event actions Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created Event 2 admitted Event 2 allocates 2 70

Resource Management Reserved: 3 Allocated: 6 Total: 10 Resource Event actions Event is blocked Event 2 allocates 2 Event 3 created 3 + 3 + 6 > 10 71

Resource Management Reserved: 3 Allocated: 6 Total: 10 Resource Event actions Event 2 allocates 2 Event 3 created Event 4 created Blocked behind event 3 FIFO admission 72

Resource Management Reserved: 3 Allocated: 4 Total: 10 Resource Event actions On free, admission condition re-evaluated Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 3 + 3 + 4 10 73

Resource Management Reserved: 6 Allocated: 4 Total: 10 Resource Event actions Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 Event 3 admitted 3 + 6 + 4 > 10 event 4 remains blocked 74

Resource Management Reserved: 3 Allocated: 4 Total: 10 Resource Event actions On un-reserve, admission condition re-evaluated Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 Event 3 admitted Event 2 completes 1 + 3 + 4 10 event 4 admitted 75

Resource Management - Reclamation - Reclamation processing First, free pages from clean cached blocks If not sufficient, initiate flush of dirty inodes Flush is an internal event with pre-reserved resources Reclamation initiated when An event is blocked A threshold is reached Threshold limit depends on resource type Metadata modification records can only be cleaned through metadata update start earlier Others (pages, log blocks) can be cleaned quicker start later 76

Page memory (MB) Resource Management Limits the amount of memory used (avoid swapping) Avoids handling allocation failures in the middle of event processing Avoids event starvation through FIFO processing Simple but effective (allows high utilization of resources) 1200 Memory reserved and allocated during sequential write 1000 800 600 400 Total Allocated Reserved 200 0 Time 77

Throughput (MB/s) Experiments Flow control API Tool Requests can be rejected ( system busy ) Clients notified to resume submission Submit requests until busy, resumes as soon as notified Maximum concurrency; No parent-child structures Upper limit of performance 350 300 250 read write 200 150 100 Hydra block store HydraFS 50 0 0 0 20 40 60 80 Duplicate ratio (%) 78

Conclusions and Future Work Conclusions Building a filesystem for a content-addressable storage system with content-defined chunking poses interesting challenges A small number of techniques was sufficient to overcome them while keeping the system relatively simple and achieving high throughput Future work Distribute the filesystem Use SSD to improve performance for metadata intensive workloads 79

Thank you! We are hiring! http://www.nec-labs.com/jobs/ 80