ZEA, A Data Management Approach for SMR Adam Manzanares
Co-Authors Western Digital Research Cyril Guyot, Damien Le Moal, Zvonimir Bandic University of California, Santa Cruz Noah Watkins, Carlos Maltzahn 2016 Western Digital Corporation or affiliates. All rights reserved. 2
Why SMR? HDDs are not going away Exponential growth of data still exists $/TB vs. Flash is still much lower We want to continue this trend! Traditional Recording (PMR) is reaching scalability limits SMR is a density enhancing technology being shipped right now. Future recording technologies may behave like SMR Write constraint similarities HAMR 2016 Western Digital Corporation or affiliates. All rights reserved. 3
Flavors of SMR SMR Constraint Writes must be sequential to avoid data loss Drive Managed Transparent to the user Comes at the cost of predictability and additional drive resources Host Aware Host is aware of SMR working model If user does something wrong the drive will fix the problem internally Host Managed Host is aware of SMR working model If user does something wrong the drive will reject the IO request 2016 Western Digital Corporation or affiliates. All rights reserved. 4
SMR Drive Device Model New SCSI standard Zoned Block Commands (ZBC) SATA equivalent ZAC Drive described by zones and their restrictions Type Write Restriction Intended Use Con Abbreviation Conventional None In-place updates Increased Resources Sequential Preferred Sequential Required None Sequential Write Mostly sequential writes Only sequential writes Variable Performance Our user space library (libzbc) queries zone information from the drive https://github.com/hgst/libzbc CZ SPZ SRZ 2016 Western Digital Corporation or affiliates. All rights reserved. 5
Why Host Managed SMR? We are chasing predictability Large systems with complex data flows Demand latency windows that are relatively tight from storage devices Translates into latency guarantees from the larger system Drive managed HDDs When is GC triggered How long does GC take Host Aware HDDs Seem ok if you do the right thing Degrade to drive managed when you don t Host Managed Complexity and variance of drive internalized schemes are now gone 2016 Western Digital Corporation or affiliates. All rights reserved. 6
Host Managed SMR seems great, but All of these layers must be SMR compatible Any IO reordering causes a problem Must not happen at any point between the user and device What about my new SMR FS? What is your GC policy? Is it tunable? Does your FS introduce latency at your discretion? What about my new KV that is SMR compatible? See above In addition, is it built on top of a file system? Fibre Channel over Ethernet tcm_fc iscsi_target_mod ISCSI tcm_qla2xxx target_core_pscsi Request-based device mapper targets dm-multipath sysfs (transport attributes) Transport classes scsi_transport_fc scsi_transport_sas scsi_transport_... libata Fibre Channel sbp_target FireWire target_core_mod tcm_usb_gadget target_core_iblock USB tcm_vhost Virtual Host LIO target_core_file target_core_user userspace SCSI mid layer scsi-mq SCSI upper level drivers /dev/sda /dev/sr* The Linux Storage Stack Diagram version 4.0, 2015-06-01 /dev/sd* /dev/st*... outlines the Linux storage stack as of Kernel version 4.0 vfs_writev, vfs_readv,... Direct I/O (O_DIRECT) Block-based FS ext2 ext3 ext4 xfs btrfs gfs BIOs (block I/Os) /dev/zram* zram ecryptfs Applications (processes) SCSI low level drivers megaraid_sas qla2xxx pm8001 iscsi_tcp virtio_scsi ufs... ifs ocfs BIOs /dev/rbd* read(2) iso9660... stackable Maps BIOs to requests noop Stackable FS I/O scheduler memory overlayfs cfq write(2) open(2) Network FS NFS coda smbfs... (optional) stat(2) VFS chmod(2) proc Pseudo FS pipefs Devices on top of normal block devices drbd LVM device mapper mdraid dm-crypt dm-mirror... dm-cache dm-thin bcache dm-raid dm-delay deadline Hardware dispatch queue Request based drivers rbd unionfs network ceph FUSE Block Layer /dev/mmcblk*p* mmc... sysfs futexfs usbfs... userspace (e.g. sshfs) network...... BIOs blkmq multi queue Software queues Hardware dispatch queues Request based drivers /dev/nullb* null_blk /dev/vd* virtio_blk Special purpose FS tmpfs devtmpfs ramfs malloc BIOs (block I/Os) /dev/rssd* mtip32xx mmap (anonymous pages) Page cache struct bio - sector on disk - sector cnt - bio_vec cnt - bio_vec index - bio_vec list BIOs hooked in device drivers (they hook in like stacked devices do) /dev/nvme*n* nvme BIO based drivers /dev/skd* skd /dev/rsxx* rsxx ahci ata_piix... aacraid lpfc mpt3sas vmw_pvscsi network HDD SSD DVD drive LSI RAID Qlogic HBA PMC-Sierra HBA para-virtualized SCSI virtio_pci mobile device flash memory Micron PCIe card nvme device stec device Physical devices Adaptec RAID Emulex HBA LSI 12Gbs SAS HBA VMware's para-virtualized SCSI SD-/MMC-Card IBM flash adapter The Linux Storage Stack Diagram http://www.thomas-krenn.com/en/wiki/linux_storage_stack_diagram Created by Werner Fischer and Georg Schönberger License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ 2016 Western Digital Corporation or affiliates. All rights reserved. 7
What Did We Do? Introduce a zone and block based extent allocator [ ZEA ] Write Logical Extent [ Zone Block Address ] Return Drive LBA Read Extent [ Logical Block Address ] Return data if extent is valid Iterators over extent metadata Allows one to build ZBA -> LBA mapping ZEA + SMR Drive Per Write Metadata 2016 Western Digital Corporation or affiliates. All rights reserved. 8
ZEA Performance vs. Block Device 2016 Western Digital Corporation or affiliates. All rights reserved. 9
ZEA Is Just An Extent Allocator, What Next? ZEA + LevelDB LevelDB is KV store library that is widely used LevelDB Backend API is good for SMR Write File Append Only, Read Randomly, Create & Delete Files, Flat Directory Lets Build a Simple FS compatible with LevelDB ZEA + LevelDB Architecture 2016 Western Digital Corporation or affiliates. All rights reserved. 10
ZEA + LevelDB Performance 2016 Western Digital Corporation or affiliates. All rights reserved. 11
Lessons Learned ZEA is a lightweight abstraction Hides sequential write constraint from application Low overhead vs. a block device when extent size is reasonable Provides addressing flexibility to application Useful for GC LevelDB integration opens up usability of Host Managed SMR HDD Unfortunately LevelDB not so great for large objects Ideal Use case for SMR drive would be large static blobs 2016 Western Digital Corporation or affiliates. All rights reserved. 12
What Is Left? What is a good interface above ZEA Garbage collection policies When and How How to use multiple zones efficiently Allocation Garbage collection Thanks for listening 2016 Western Digital Corporation or affiliates. All rights reserved. 13
adam.manzanares@hgst.com 2016 Western Digital Corporation or affiliates. All rights reserved. 14