<Insert Picture Here> XFS In Rapid Development Jeff Liu <jeff.liu@oracle.com>
We have had many requests to provide a supported option for the XFS file system on Oracle Linux... -- Oracle Linux Blog Feb 28, 2013 2
Overview Introduction & Background XFS Development Community How Fast XFS Is Going - Kernel - User Space Program - XFS Test Suite Kernel Changes (> upstream 3.0) Upcoming Features 3
Introduction & Background 4
XFS Development Community Developers From Corporations - SGI, Redhat, Oracle, SuSE, IBM Maintainer Ben Myers @SGI Main Contributors Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors Ben Myers, Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely - Listed in alphabetical order 5
How Fast XFS Is Going The statistics of code changes between Linux v3.0 - v3.10-rc1 (Jul 21 2011 - May 11 2013) Btrfs/Ext4 with JBD2/XFS gid diff --stat --minimal -C -M v3.0..v3.10-rc1 -- fs/[btrfs xfs ext4 with jbd2] The number of files changed, insertions and deletions 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Files changed Insertions Deletions Ext4&JBD2 XFS Btrfs Linux v3.0 ~ v3.10-rc1 6
How Fast XFS Is Going Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 092013) - 15 Contributors - 106 patches $ git diff stat minimal C M v3.1.6 v3.1.11 grep changed 108 files changed, 11113 insertions(+), 11418 deletions( ) 7
How Fast XFS Is Going XFS Test Suite (aka xfstests) - Has become the generic test tool for Linux local kernel file systems 170+ special test cases to XFS up to May 16 2013 Test cases are refactored with well-organized hierarchy $ ls l xfstests/tests/ btrfs/ ext4/ generic/ Makefile shared/ udf/ xfs/ 8
Speedup Direct-IO R/W on high IOPS devices FIO Scenario Storage formated with default options Fio version 2.1 Simplified output of xfs_info(8) Direct=1 rw=randrw bs=4k size=10g Numjobs=10 #[20,40,80] Runtime=120 Thread ioengine=psync Metadata: isize=256 agcount=4 agsize=937408 blks sectsz=512 Data: bsize=4096 blocks=3749632 sunit=0 swidth=0 blks Log: internal bsize=4096 blocks=2560 version=2 9
Speedup Direct-IO R/W on high IOPS devices XFS Write IOPS, SSD SATA3 14000 Vanilla 3.7.0 vs 2.6.39 in delaylog mode Input/Output operations per second 12000 10000 8000 6000 4000 2000 0 10 20 40 80 Threads 2.6.39 3.7.0 11
Speedup Direct-IO R/W on high IOPS devices Improvements behind the scenes - Check the page cache stat via shared IO lock - No exclusive locking for the normal direct IO case - Do not serialize direct IO reads on page cache checks - Do not serialize adjacent concurrent direct IO appending writes 12
Fsync(2) & Sync(2) story Xfssyncd workqueue was removed - Move xfs_sync source to xfs_icache - New dedicated workqueue(xfs_reclaim) for inode reclaim - New dedicated workqueue(xfs_log) for log stuff - Now the sync work is periodic log work only for xfsyncd_centisecs_sysctl 14
Improve sparse file handing SEEK_DATA/SEEK_HOLE options to lseek(2) - Derive from Solaris ZFS - Neater call interface than FIEMAP ioctl(2) - Refinement for unwritten extents - Provides more efficient sparse file detection and backup Further reading http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/sparse-improvements-lpc-2012.pdf 15
Quota improvements Bad scalability for tens thousands of in-memory dquot searching, why? - User/Group/Project dquots are stored at a global hash table which is shared between file systems Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion 16
Quota improvements Better in-memory quota caching and scalability - Replace global hash tables with per-filesystem radix tree - Replace global dquot lru lists with per-filesystems - Remove the global xfs_gqm structure 17
New allocation workqueue - xfsalloc 8K stack space for x86_64 in Linux 2.6 - Extreme stack use in the Linux VM/VFS call chain - XFS even worse in memory reclaim situations(aka writeback) - Buffer cache miss that triggers IO vs CPU cache miss Alleviate stack allocation in allocation call chain - Move all allocations to a separate context - Workqueue with a completion 18
Speculative preallocation improvements Preserves holey files Limit speculative prealloc near ENOSPC thresholds Quota-driven speculative preallocation throttling Limit speculative prealloc size on sparse files A worker that periodically cleans up speculative preallocation 19
Bounds checking enabled XFS module Benefits of bounds checking enabled kernel Weak points of CONFIG_XFS_DEBUG from a user perspective - Significant overhead in production environment - Change the behavior of algorithms (such as allocation) to improve test coverage - Would intentionally panic the machine on non-fatal errors by design Only advisable to use for debugging purpose 20
Bounds checking enabled XFS module Alternative CONFIG_XFS_WARN Support - Converts ASSERT() checks to WARN_ON(1) - Does not modify algorithms - Does not cause kernel to panic on non-fatal errors - Allow to find strange "out of bounds" problems more easily - Already turned on Fedora kernel-debug packages - Suggest applying this feature for other Linux distributions with XFS support 21
Misc changes The freeze caused XFS hang in Linux 3.0 - Fixed by converting to new VFS freezing mechanism in Linux 3.5 Mount options - Nodelaylog option no more and in delaylog mode by default (>= Linux 3.3) - Inode64 re-mountable - Inode32 re-mountable Native support for discontiguous buffers - Virtually contiguous in the buffers, non-contiguous on disk 22
The largest scalability problem facing XFS -- Self Describing Metadata preview 23
Self Describing Metadata preview The XFS filesystem is a journaling file system known for highperformance and scalability. Yep, indeed! - Full 64-bit addressing - Scalable structures and algorithms But the verification of the file system structure... :( 24
Self Describing Metadata preview Forensic analysis of the file system structure via xfs_repair(8)/xfs_db(8) Analyze the structure of 100TB to 1PB storage Primary concern for supporting PB scale file system - Minimize the time and effort required for basic forensic analysis of the file system structure 25
Then What? Start your journey to XFS for fun and profit by # mkfs.xfs /your_storage 26
References http://xfs.org/index.php/xfs_status_update_for_2011 http://xfs.org/index.php/xfs_status_update_for_2012 http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html http://oss.sgi.com/archives/xfs/2013-04/msg00100.html http://lwn.net/articles/476267/ http://lwn.net/articles/476263/ http://lwn.net/articles/84583/ http://en.wikipedia.org/wiki/hash_table 27
Acknowledgments Thanks you guys for reviewing this document with nice comments in alphabetical order: Ben Myers, Dave Chinner, Eric Sandeen, Mark Tinguely I would like to thank Christoph Hellwig for updating the XFS development status per every Linux offical release between 2011 to 2012 as those updates saved me a lot of time to see the progress in that period. 28
Questions & Flames 29