The Btrfs Filesystem Chris Mason
Btrfs Design Goals Broad development community General purpose filesystem that scales to very large storage Extents for large files Small files packed in as metadata Flexible disk format that can adapt to new features Btree indexes based on extensible key/value lookups Key ordering determines relative location in the btree Data and metadata checksumming Crc32c used for fast hardware enabled crcs
Btrfs Design Goals Data and metadata copy on write Block contents preserved until replacement is safely on disk Data and metadata reference counting with back references Every block and filename link back to their owners Fast, writable snapshots COW enables O(1) snapshots of subvolumes O(number of extents in the file) snapshots of single files Efficient detection of recently modified files
Btrfs Design Goals Simple, online disk administration Btrfs dev add /dev/xxx /mnt Btrfs dev delete /dev/xxx /mnt Btrfs filesystem resize XX /mnt Can also resize a single device Btrfs filesystem balance /mnt Multiple device support Flexible relocation of space Easily find good copies when crcs fail Efficient synchronous operations that do not stall the rest of the filesystem These goals have been met!
Snapshots and Subvolumes Subvolume is the unit of snapshotting Snapshots are very efficient, even when many are in place against the same source Individual files may be cloned without a full snapshot Cloning support now in cp --relink Subvolumes and snapshots may be created anywhere Subvolumes are roughly as expensive as directories But, you may not rename or hardlink files between subvolumes Snapshots can be written and snapshotted again
Snapshot Rollback The snapshot or subvolume used as the root of the filesystem can be specified Btrfs subvol list to find subvolumes btrfs subvolume setdefault to set a new default Allows you to snapshot before upgrading and rollback if things don't work well
Current Work In Progress Fsck with repair Initially fs rescue Robust error handling RAID5/6 Reuse MD's parity calculation code Single stripe size, adapt allocator and FS writeback to send down full stripes SSD front end cache Locking bottlenecks
SSD Optimizations Really just turning off rotational optimizations Send IO to the device right away No stalling or waiting to collect more IO Don't avoid fragmentation Send large writes whenever possible Reuse blocks instead of spreading across the device Unless you're on a cheap SSD Send discards down in large batches Collected in bulk and sent down right after transaction commit
Why Discard/Trim
SSD Front End Cache Stage writes to a set of fast SSD devices Remapping layer to remember which blocks are up to date on the SSD Push frequently read extents into the SSD as well Hot data will stay on the SSD without hitting spinning disks Work in progress, slightly different from IBM's experiments over the summer
Thin Provisioning Btrfs storage chunks are well suited to thin provisioning Btrfs can return large chunks of storage back to the array Btrfs can quickly expand the FS Discard support in Btrfs sends information about unused blocks down to the storage at run time Fitrim ioctl support is important for thin provisioning
Atomic Writes for Applications COW writes to Btrfs can be atomic up to large sizes Some hardware support fast atomic writes of larger Ios as well Work in progress to wire up Btrfs atomic write support and use optimizations from the hardware We may also support linked atomic writes between two or more files
Database Write Performance Poor random write performance in COW mode Large files tend to fragment badly, leading to huge amounts of metadata and seeking New data from random writes can be collected in bulk after transaction commit and copied back to the original location Work in progress
Finding Recent Modifications Btrfs subvol find-new
Btrfs Scrubbing Scrubbing finds and repairs bad data Read all the allocated extents Verify checksums Replace bad copies with correct mirror Work in progress, initial implementation working
Conclusions Many things working and stable Focused on stability and performance http://btrfs.wiki.kernel.org/ chris.mason@oracle.com