Ext4 Filesystem Scaling

Similar documents
Ext3/4 file systems. Don Porter CSE 506

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Looking for Ways to Improve the Performance of Ext4 Journaling

CSE506: Operating Systems CSE 506: Operating Systems

CS 318 Principles of Operating Systems

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CSE 153 Design of Operating Systems

<Insert Picture Here> XFS In Rapid Development

Operating Systems. File Systems. Thomas Ropars.

Chapter 10: Case Studies. So what happens in a real operating system?

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Chapter 6: File Systems

CSE 124: Networked Services Lecture-17

CS5460: Operating Systems Lecture 20: File System Reliability

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Caching and reliability

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

File Systems. Chapter 11, 13 OSPP

Rdb features for high performance application

I/O and file systems. Dealing with device heterogeneity

Operating Systems Design Exam 2 Review: Spring 2011

CS 416: Opera-ng Systems Design March 23, 2012

Lecture 10: Crash Recovery, Logging

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

The Berkeley File System. The Original File System. Background. Why is the bandwidth low?

Lecture 11: Linux ext3 crash recovery

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

2. PICTURE: Cut and paste from paper

Heckaton. SQL Server's Memory Optimized OLTP Engine

EECS 482 Introduction to Operating Systems

File System Implementations

Review: FFS background

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

The Google File System

Ext4, btrfs, and the others

CISC 7310X. C08: Virtual Memory. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/22/2018 CUNY Brooklyn College

SpanFS: A Scalable File System on Fast Storage Devices

Chapter 3 - Memory Management

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

File System Consistency

CSE506: Operating Systems CSE 506: Operating Systems

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

JANUARY 20, 2016, SAN JOSE, CA. Microsoft. Microsoft SQL Hekaton Towards Large Scale Use of PM for In-memory Databases

W4118 Operating Systems. Instructor: Junfeng Yang

CS 4284 Systems Capstone

CSL373/CSL633 Major Exam Solutions Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

Inside the PostgreSQL Shared Buffer Cache

CLOUD-SCALE FILE SYSTEMS

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

SMD149 - Operating Systems - File systems

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Design of Operating System

File Systems: Consistency Issues

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Assignment 4 Section Notes #2

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Operating Systems Lecture 6: Memory Management II

Operating Systems. Operating Systems Professor Sina Meraji U of T

The Journalling Flash File System

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

ScaleFS: A Multicore-Scalable File System. Rasha Eqbal

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File systems. Chapter 6: File Systems. Naming files. Long-term information storage. Typical file extensions. File structures

ijournaling: Fine-Grained Journaling for Improving the Latency of Fsync System Call

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

Fine-grained Metadata Journaling on NVM

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Announcements. Persistence: Log-Structured FS (LFS)

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

PostgreSQL Entangled in Locks:

Buffer Management for XFS in Linux. William J. Earl SGI

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

21 Lecture: More VM Announcements From last time: Virtual Memory (Fotheringham, 1961)

File Systems: Fundamentals

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 318 Principles of Operating Systems

2 nd Half. Memory management Disk management Network and Security Virtual machine

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

ext3 Journaling File System

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

ext4: the next generation of the ext3 file system

File Systems: Fundamentals

Operating Systems Design Exam 2 Review: Fall 2010

The Journey of an I/O request through the Block Layer

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

SEV: a Storage-Efficient Versioning File System

Emulating Goliath Storage Systems with David

Transcription:

Ext4 Filesystem Scaling Jan Kára <jack@suse.cz> SUSE Labs

Overview Handling of orphan inodes in ext4 Shrinking cache of logical to physical block mappings Cleanup of transaction checkpoint lists 2

Orphan Inode Handling

Orphan list List of inodes for tracking unlinked open files and files with truncate in progress In case of crash we walk the orphan list and finish truncates and remove unlinked inodes Necessary to avoid leaking such inodes / space on crash Added for ext3 filesystem (around 2000) List is filesystem-wide scalability issue for parallel truncates / unlinks SB Inode A Inode B Inode C 4

Orphan list contention Orphan stress test with 16 processes on 8 CPU server with 4 GB of RAM, filesystem on ramdisk Lockstat except for orphan lock: wait max wait total wait avg acquisitions hold max hold total hold avg 12211.31 2288111288.96 244.07 26209088 85.74 62275116.77 2.38 Test execution: 160 s, tasks wait on orphan mutex for 2288 s in total 143 s per task. 5

Orphan List Scalability Improvements Joint work with Thavatchai Makphaibulchoke List handling inherently unscalable unsolvable without on disk format changes Still can do something: Remove unnecessary global lock acquisitions when inode is already part of orphan list (inode modifications itself guarded by inode-local lock) Make critical section shorter Effective for insertion blocks to modify known in advace Not for deletion blocks to modify known only after taking global lock Merged in 3.16 6

Testing Effect of Changes Test data from 48-way Xeon Server with 32 GB RAM Ext4 filesystem on ramdisk preallocated with NUMA interleave policy Average of 5 runs Run stress-orphan microbenchmark and reaim new_fserver workload Reaim modified to not call sync(1) after every read & write 7

Effect of Tweaks (stress-orphan) 8

Effect of Tweaks (reaim - fserver) 9

Orphan File Experimental patches Preallocated system file where we store numbers of orphan inodes When we run out of space in orphan file (too many files deleted in parallel) we fall back to old orphan list Hack: Skip journalling block if it is part of running transaction Real fix: Improve JBD2 calls to have low overhead if there's nothing to do (block already in the transaction in correct state) Block 1 Ino1 Ino2... InoN Block 2 Ino1 Ino2... InoN... 10

Orphan slot allocation strategies Simple way to search for free slot in a file sequential search from the beginning under a spinlock More complex: Hash CPUID to a block in the orphan file (can have easily more blocks for a CPU), start searching at that block. With atomic counters for free entries in a block and cmpxchg() for storing inode number in a slot completely lockless (except for journalling locks in rare cases) 11

Perf. With Orphan File (stress-orphan) 12

Perf. With Orphan File (reaim - fserver) 13

Extent Status Tree Shrinker

Extent Status Tree Cache for logical to physical block mapping for files For each entry we have logical offset, physical block, length, extent type (delayed allocated, hole, unwritten, written) Organized in RB-tree sorted by logical offset In case of memory pressure we need to free some entries shrinker Entries that have delayed allocated type cannot be freed MM calls shrinker asking it to scan N objects and free as much as possible 15

Extent Status Tree Shrinker Need to find entries to reclaim without bloating each entry too much Each entry has 16 bytes of useful data Keep timestamp when last reclaimable extent was used in each inode On reclaim request scan RB trees of inodes with oldest timestamps until N entries were reclaimed Need to sort list of inodes (oops) Scanning may need to skip lots of unreclaimable extents Causing large stalls (watchdog triggered) on machines with lots of active inodes 16

Improving Shrinker Joint work with Zheng Liu Walk inodes in round-robin order instead of LRU No need for list sorting Remember where we stopped scanning inode RB-tree and start next scan there Enough to scan (not reclaim) N entries Add simple aging via extent REFERENCED bit Merged in 3.19 17

Shrinker Tests Create 32 KB sparse files from 16 processes Show maximum latency of a shrinker in 5 minute run Large memory allocation once per minute to trigger reclaim Test run on 8 CPU machine with 4 GB RAM, filesystem on standard SATA drive 18

Test With Lots of Files 19

Test With Two 1GB Files Single process creates two 1GB sparse files Large allocation to trigger reclaim Max latency reduced from 63132 us to 100 us 20

Checkpoint List Scanning

Checkpoint List Processing Joint work with Yuanhan Liu JBD2 tracks all buffers that need checkpointing with each transaction List cleanup on transaction commit and when running out of journal space Expensive for loads with lots of small transactions fs_mark creating 100000 files, 4 KB each, with fsync 44.08% jbd2/ram0 8 [kernel.kallsyms] 0xffffffff81017fe9 23.76% jbd2/ram0 8 [jbd2] journal_clean_one_cp_list 17.22% jbd2/ram0 8 [jbd2] jbd2_journal_clean_checkpoint Scanning list repeatedly not cleaning up anything 22

Checkpoint List Processing Speedup No big point in further scanning once we find buffer that cannot be cleaned up Reduces cleanup time from O(N 2 ) to O(N) Merged for 3.18 23

Conclusion

Takeaways Even if you cannot fundamentally improve scalability, just reducing length of critical section can help a lot But doing things lockless is still faster Balance between sophisticated but slow and simple and fast 25

Thank you