Linux SMR Support Status

Similar documents
Linux Kernel Support for Hybrid SMR Devices. Damien Le Moal, Director, System Software Group, Western Digital

October 30-31, 2014 Paris

SMR in Linux Systems. Seagate's Contribution to Legacy File Systems. Adrian Palmer, Drive Development Engineering

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZEA, A Data Management Approach for SMR. Adam Manzanares

Shingled Magnetic Recording (SMR) Panel: Data Management Techniques Examined Tom Coughlin Coughlin Associates

<Insert Picture Here> Btrfs Filesystem

Open Source for OSD. Dan Messinger

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Using Transparent Compression to Improve SSD-based I/O Caches

pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

Skylight A Window on Shingled Disk Operation. Abutalib Aghayev, Peter Desnoyers Northeastern University

Proceedings of the Linux Symposium. July 13th 17th, 2009 Montreal, Quebec Canada

I/O Topology. Martin K. Petersen Oracle Abstract. 1 Disk Drives and Block Sizes. 2 Partitions

Open-Channel SSDs Then. Now. And Beyond. Matias Bjørling, March 22, Copyright 2017 CNEX Labs

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

An SMR-aware Append-only File System Chi-Young Ku Stephen P. Morgan Futurewei Technologies, Inc. Huawei R&D USA

W4118 Operating Systems. Instructor: Junfeng Yang

<Insert Picture Here> Filesystem Features and Performance

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Longhorn Large Sector Size Support. Anuraag Tiwari Program Manager Core File System

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Ben Walker Data Center Group Intel Corporation

Linux storage system basics

Presented by: Nafiseh Mahmoudi Spring 2017

SMORE: A Cold Data Object Store for SMR Drives

January 28-29, 2014 San Jose

PowerVault MD3 SSD Cache Overview

Chapter 10: Case Studies. So what happens in a real operating system?

Linux Filesystems and Storage Chris Mason Fusion-io

Linux Internals For MySQL DBAs. Ryan Lowe Marcos Albe Chris Giard Daniel Nichter Syam Purnam Emily Slocombe Le Peter Boros

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 12: File System Implementation

Emulating Goliath Storage Systems with David

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

CS370 Operating Systems

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

CS 111. Operating Systems Peter Reiher

8. Files and File Systems

Linux Filesystems Ext2, Ext3. Nafisa Kazi

ExpressSAS Host Adapter 6Gb v2.30 Windows

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Open Source support for OSD

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

The Btrfs Filesystem. Chris Mason

CSE 333 Lecture 9 - storage

FlashCache. Mohan Srinivasan Mark Callaghan July 2010

Chapter 11: Implementing File Systems

CS-580K/480K Advanced Topics in Cloud Computing. Storage Virtualization

Operating Systems. Operating Systems Professor Sina Meraji U of T

Hard Disk Drives (HDDs)

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Operating Systems 2014 Assignment 4: File Systems

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

<Insert Picture Here> Linux Data Integrity

Chapter 12: File System Implementation

Linux File Systems: Challenges and Futures Ric Wheeler Red Hat

ECE 598 Advanced Operating Systems Lecture 19

Using SMR Drives with Smart Storage Stack-Based HBA and RAID Solutions

CS370 Operating Systems

Chapter 10: File System Implementation

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Proceedings of the Linux Symposium. June 27th 30th, 2007 Ottawa, Ontario Canada

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

[537] Journaling. Tyler Harter

CA485 Ray Walshe Google File System

CLOUD-SCALE FILE SYSTEMS

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

CS370 Operating Systems

Hard Disk Drives (HDDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

Advantech. AQS-I42N I Series. Semi-Industrial Temperature. Datasheet. SATA III 6Gb/s M.2 SSD Semi-Industrial Temp AQS-I42N I series

Chunling Wang, Dandan Wang, Yunpeng Chai, Chuanwen Wang and Diansen Sun Renmin University of China

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

Example Implementations of File Systems

Chapter 12: File System Implementation

SFS: Random Write Considered Harmful in Solid State Drives

Enabling Cost-effective Data Processing with Smart SSD

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

Don t stack your Log on my Log

we are here Page 1 Recall: How do we Hide I/O Latency? I/O & Storage Layers Recall: C Low level I/O

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

2011/11/04 Sunwook Bae

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.


Clotho: Transparent Data Versioning at the Block I/O Level

Chapter 11: Implementing File Systems. Operating System Concepts 8 th Edition,

Serial ATA (SATA) Interface. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to Open-Channel Solid State Drives

Transcription:

Linux SMR Support Status Damien Le Moal Vault Linux Storage and Filesystems Conference - 2017 March 23rd, 2017

Outline Standards and Kernel Support Status Kernel Details - What was needed Block stack File systems Application level tools Ongoing Work - Device mapper and file systems Performance Evaluation Results - dbench - Sustained write workloads 2

Standards and Kernel Support Status ZBC and ZAC specifications are stable ZAC & ZBC specifications r05 forwarded to INCITS for publication Latest SAT4 r06 specifications include zone block commands translation description Kernel support finalized with addition of ZBC commands in kernel 4.10 Host-managed disks are exposed as regular block devices Full support for ZBC to ZAC command translation 2016 2014 7 8 9 10 3.17 11 12 5 3.18 TYPE_ZBC SCSI device type defined TYPE_ZAC ATA device signature defined 6 7 4.7 8 2017 9 10 4.8 ZAC support released 11 12 4.9 1 2 3 4 5 6 7 4.10 ZBC support released 3

What Was Needed? All constraints come from host-managed SMR comes in different flavors Drive-managed The disk presents itself as a regular block device No requirement No constraints, no specific optimization possible Host-aware The disk presents itself as a regular block device No constraints, specific optimization possible Host-managed Not a regular block device (different device type) Sequential write constraint Support for ZBC/ZAC commands (optimization) Support for ZBC/ZAC commands, sequential write enforcement, or at least guarantees 4

Block Stack No real change, only additions Internal API for report zones and reset write pointer No in kernel support for open zone, close zone and finish zone Any zone command allowed in pass-through mode with SG_IO Fully functional SAT layer User level Some limits are imposed on disk zone configurations Kernel VFS All zones must have the same size Except for an eventual last smaller runt zone Zone size must be a power of 2 number of LBAs Reads are unrestricted (URSWRZ bit set to 1) Application File System Kernel Host-managed disks are seen as almost-regular block devices Block Layer Block I/O scheduler SCSI / ATA stack HBA Driver Sequential write constraint exposed to the drive user Disk user must ensure sequential write patterns to sequential zones File systems, device mappers and applications Write ordering guarantees implemented by limiting write queue depth to 1 per zone Also solves many HBA level command ordering problems, including AHCI 5

Block Stack Device compliance checked on boot Constraint compliance, zone size and device type are checked at boot time On device revalidate [ [ [ [ [ [ [ [ [ [ 3.687797] 3.696359] 3.696485] 3.865072] 3.873046] 3.878343] 3.884591] 3.889440] 3.889458] 4.253140] scsi 5:0:0:0: Direct-Access-ZBC ATA HGST HSH721414AL TE8C PQ: 0 ANSI: 7 sd 5:0:0:0: Attached scsi generic sg4 type 20 sd 5:0:0:0: [sdd] Host-managed zoned block device sd 5:0:0:0: [sdd] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB) sd 5:0:0:0: [sdd] 4096-byte physical blocks sd 5:0:0:0: [sdd] 52156 zones of 524288 logical blocks sd 5:0:0:0: [sdd] Write Protect is off sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00 sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 5:0:0:0: [sdd] Attached SCSI disk Information available to applications through sysfs files zoned file for device type: host-managed, host-aware or none chunk_sectors for zone size > cat /sys/block/sdd/queue/zoned host-managed > cat /sys/block/sdd/queue/chunk_sectors 524288 6

Block Stack API Internal kernel functions Zone information reported as struct blk_zone > less include/uapi/linux/blkzoned.h... struct blk_zone { u64 start; /* Zone start sector */ u64 len; /* Zone length in number of sectors */ u64 wp; /* Zone write pointer position */ u8 type; /* Zone type */ u8 cond; /* Zone condition */ u8 non_seq; /* Non-sequential write resources active */ u8 reset; /* Reset write pointer recommended */ u8 reserved[36]; }; Zone report and reset functions provided Suitable for file systems or device mappers (struct block_device) > less include/uapi/linux/blkzoned.h... extern int blkdev_report_zones(struct block_device *bdev, sector_t sector, struct blk_zone *zones, unsigned int *nr_zones, gfp_t gfp_mask); extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors, sector_t nr_sectors, gfp_t gfp_mask); 7

Block Stack API ioctl provided to applications Zone report and reset allowed from applications through ioctl Zone report ioctl interface uses same struct blk_zone as the kernel > less include/uapi/linux/blkzoned.h... /** * Zoned block device ioctl's: * * @BLKREPORTZONE: Get zone information. Takes a zone report as argument. * The zone report will start from the zone containing the * sector specified in the report request structure. * @BLKRESETZONE: Reset the write pointer of the zones in the specified * sector range. The sector range must be zone aligned. */ #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report) #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range) 8

File Systems Native support for f2fs F2FS native support for zoned block devices included in kernel 4.10 Completely hides sequential write constraint to application level Added on top of lfs mode of f2fs Pure log-structured operation User level No optimization with update-in-place for metadata blocks Kernel VFS Sections (segments group) aligned to device zones Checks in mkfs.f2fs user application Fixed location metadata placed in conventional zones Other problems fixed Sequential use of segments within sections Atomic block allocation and I/O issuing operations Application File System Kernel Block Layer Block I/O scheduler SCSI / ATA stack HBA Driver Also benefits SSDs Discard granularity Per section, trigger zone reset 9

Application Level Tools Different choices available libzbc (https://github.com/hgst/libzbc) Provides an API library for executing all defined ZBC and ZAC commands Passthrough (SG_IO) or using kernel block device and ioctls sg3utils (http://sg.danny.cz/sg/sg3_utils.html) Legacy, well known SCSI command application tool chain The current version 1.42 supports ZBC sg_rep_zones, sg_reset_wp, sg_zone tools provided Rely on SAT layer for ZBC command translation to ZAC command Kernel SAT layer (libata) or HBA SAT layer Linux sys-utils User level Application Kernel VFS File System Kernel Block Layer Block I/O scheduler SCSI / ATA stack HBA Driver Part of util-linux repository (https://github.com/karelzak/util-linux) blkzone utility added Support zone report and zone reset Implemented using the BLKZONEREPORT and BLKZONERESET ioctl 10

On-Going Work: Device Mapper dm-zoned: Expose a zoned block device as a regular block device Uses conventional zones as on-disk random-write buffers Conventional zone reclaimed on idle time or on-demand Minimal capacity loss and resource usage A few 256MB zones on 14 TB disk for internal meta-data 3~4 MB of memory per disk for metadata caching Patches submitted for review Possible extensions being discussed At LSF/MM: merge with compression target? 11

Ongoing Work: File Systems BtrFS (and XFS) BtrFS Some changes at format time Super block copies all in conventional zones Ensure that 1GB block groups aligned to device zones Run time fixes to ensure that 1 GB block groups are written sequentially E.g. if multiple files are being written in the same block group Fixed block pre-allocation Ensure that block allocation and I/O submission are atomic operations Fixed ordering of writes with flush requests Patches almost ready for review XFS Introduction of reverse block mapping b-tree makes it possible to implement ZBC support Needed for metadata updates after random file block modification and zone GC Activity not started yet... 12

Evaluation Results: DBench File systems vs device mapper Compare software changes, not disk drives! Used same physical disk Regular 6 TB SAS disk firmware changed to add ZBC interface and write constraints Host-managed disk model 22000 zones of 256 MB 1% of zones are CMR at LBA 0 Regular dbench benchmark No SMR specific optimization 1 and 32 clients 13

Dbench No significant performance penalty Under mixed read-write workloads, no significant performance degradation with SMR Device mapper can improve performance Random writes become sequential BtrFS and f2fs on SMR slightly lower performance 14

Evaluation Results: Sustained Write Workload File systems vs device mapper Same drives Simple application writing random size files to different directories 1 to 4 MB random size Think cat pictures 20 concurrent writer processes Measure bandwidth every 200 files written 15

Sustained Write Workload ext4 dm-zoned reclaim of write buffer zones overhead is clear Too many random metadata updates 16

Sustained Write Workload ext4 with packed metadata Better on standard disk, no significant improvements on dm-zoned Concurrent writes by different processes do not generate a sequential write sequence per zone 17

Sustained Write Workload XFS Initial high performance on dm-zoned better (longer) than ext4 But overall no significant improvements, similar average 18

Sustained Write Workload f2fs SMR does not matter! Same performance 19

Sustained Write Workload f2fs Performance is sustained until disk full Only OD-ID performance change 20

Sustained Write Workload f2fs under GC GC Can be very costly dm-zoned doing better Write after random directory deletion Write after random file deletion 21

Conclusions SMR constraints can be dealt with ZBC kernel support is simple No intelligence to sequentialize writes But flexible for file systems and device mappers File system approach is better More information to work with compared to device mappers Better performance E.g. F2FS More tuning necessary More work coming online in the next few month BtrFS, XFS dm-zoned (with compression?) 22