Introduction to Open-Channel Solid State Drives and What s Next! Matias Bjørling Director, Solid-State System Software September 25rd, 2018 Storage Developer Conference 2018, Santa Clara, CA
Forward-Looking Statements Safe Harbor Disclaimers This presentation contains forward-looking statements that involve risks and uncertainties, including, but not limited to, statements regarding our solid-state technologies, product development efforts, software development and potential contributions, growth opportunities, and demand and market trends. Forward-looking statements should not be read as a guarantee of future performance or results, and will not necessarily be accurate indications of the times at, or by, which such performance or results will be achieved, if at all. Forward-looking statements are subject to risks and uncertainties that could cause actual performance or results to differ materially from those expressed in or suggested by the forwardlooking statements. Key risks and uncertainties include volatility in global economic conditions, business conditions and growth in the storage ecosystem, impact of competitive products and pricing, market acceptance and cost of commodity materials and specialized product components, actions by competitors, unexpected advances in competing technologies, difficulties or delays in manufacturing, and other risks and uncertainties listed in the company s filings with the Securities and Exchange Commission (the SEC ) and available on the SEC s website at www.sec.gov, including our most recently filed periodic report, to which your attention is directed. We do not undertake any obligation to publicly update or revise any forwardlooking statement, whether as a result of new information, future developments or otherwise, except as required by law. 2
Agenda 1 Motivation 2 Interface 3 Eco-system 4 What s Next? Standardization 3
4K Random Read Latency 0% Writes - Read Latency 4K Random Read / 4K Random Write I/O Percentiles 4
4K Random Read Latency 20% Writes - Read Latency 4K Random Read / 4K Random Write 4ms! Signficant outliers! Worst-case 30X I/O Percentiles 5
NAND Chip Density Continues to Grow While Cost/GB decreases 120 3D NAND Layers SLC MLC TLC QLC 100 96 80 60 48 64 Workload #2 Workload #3 40 20 Workload #1 Workload #4 0 2015 2017 2018 6
Ubiquitous Workloads Efficiency of the Cloud requires many different workloads of a single SSD Databases Sensors Analytics Virtualization Video SSD 7
Solid State Drive Internals Read/Write Host Interface (NVMe) NAND Read/Program/Erase Highly Parallel Architecture Tens of Dies NAND Access Latencies Translation Layer Logical to Physical Translation Layer Wear-leveling Garbage Collection Bad block management Media error handling Etc. Read/Write/Erase -> Read/Write Solid State Drive Logical to Physical Translation Map Bad Block Management Die0 Die1 Die2 Die3 Die0 Die1 Die2 Die3 Wear-Leveling Media Error Handling Media Interface Controller Read/Program/Erase Die0 Die1 Die2 Die3 Garbage Collection Highly Parallel Architecture Die0 Die1 Die2 Die3 Read (50-100us) Write (1-10ms) Erase (3-15ms) 8
Single-User Workloads Indirection and Indirect Writes causes outliers Host: Log-on-Log Device: Indirect Writes User Space Log-structured Database (e.g., RocksDB) Metadata Mgmt. Address Mapping Garbage Collection 1 Writes Reads pread/pwrite Log- Structured ii Kernel Space VFS Log-structured File-system Metadata Mgmt. Address Mapping Garbage Collection 2 Solid-State Drive Pipeline Write Buffer die 0 die 1 NAND Controller die 2 die 3 Block Layer HW Read/Write/Trim Solid-State Drive Metadata Mgmt. Address Mapping Garbage Collection 3 Indirect Writes Drive maps logical data to the physical location using Best Effort Unable to align data logically = Write amplification increase + extra GC Host is oblivious to data placement due to the indirection 9
Open-Channel SSDs I/O Isolation Predictable Latency Data Placement & I/O Scheduling 10
Solid State Drive Internals Host Responsibility Logical to Physical Translation Map Garbage Collection Logical Wear-leveling Hint to host to place hot/cold data NVMe Device Driver Integration Block device Host-side FTL that does L2P, GC, Logical Wearleveling Similar overhead to traditional SSDs Applications Databases and File-systems Solid State Drive Logical to Physical Translation Map Bad Block Management Logical Device Wear-Leveling Media Error Handling Garbage Collection Media Interface Controller Read/Program/Erase Die0 Die0 Die0 Die0 Die1 Die1 11
Concepts in an Open-Channel SSD Interface Blocks Chunks Sequential write only LBA ranges Align writes to internal block sizes Hierarchical addressing A sparse addressing scheme projected onto the NVMe LBA address space Host-assisted Media Refresh Improve I/O predictability Host-assisted Wear-leveling Improve wear-leveling 12
Chunks #1 Enable orders of magnitude reduction of device-side DRAM A chunk is a range of LBAs where writes must be sequential. Reduces DRAM for L2P table by orders of magnitude Hot/Cold data separation Rewrite requires a reset A chunk can be in one of four states (free/open/closed/offline) If a chunk is open, there is a write pointer associated. Same device model as the ZAC/ZBC standards. Similar device model to be standardized in NVMe (I ll come back to this) Namespace Chunk 0 Chunk 1 Chunk X 0... Max LBAs
Chunks #2 Drive capacity divided into chunks Chunk types Conventional Random or Sequential Sequential Write Required Chunk must be written sequential only Must be reset entirely before being rewritten Disk LBA range divided in chunks Write pointer position Write commands advance the write pointer Reset commands rewind the write pointer 9/22/2018 14
Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk Hierarchical Addressing Channels and Dies are mapped to Logical Groups and Parallel Units Expose device parallelism through Groups/Parallel Units One or a group of dies are exposed as parallel units to the host Parallel units are a logical representation Physical LBA Address Space Logical Host SSD Group 0 1 Group - 1 LUN PU (Dies) Groups (Channels) LUN PU (Dies) PU Chunk LBA 0 1 PU - 1 0 1 Chunk - 1 0 1 LBA - 1 PU PU PU Group(s) NVMe Namespace PU 15
OCSSD Host-assisted Media Refresh Enable host to assist SSD data refresh SSDs refreshes its data periodically to maintain reliability. It does this through a data scrubbing process Internal read and writes make the drive I/O latencies unpredictable. Writes dominates I/O outliers 2-step Data Refresh Device to only perform the data scrubbing read part - Data movement is managed by host Increases predictability of the drive. Host manages refresh strategy Should it refresh? Is there a copy elsewhere? die 0 die 1 Step 1 Step 2 NVMe AER (Chunk Notification Entry) Read-only Scrubbing NAND Controller die 0 die 1 die 0 die 1 Refresh Data (If necessary) die 0 die 1 16
Host-assisted Wear-Leveling Enable host to separate Hot/Cold data to Chunks depending on wear SSDs typically does not know the temperature of newly written data Placing hot and cold data together increases write amplication Write amplication is often 4-5X for SSDs with no optimizations Chunk characteristics Limited reset cycles (as NAND blocks has limited erase cycles) Place cold data on chunks that are nearer end-of-life and use younger chunks for hot data Approach Introduce per-chunk relative wear-level indicator (WLI) Host knows workload and places data w.r.t. to WLI Reduces garbage collection Increases lifetime, I/O latency, Flash and Memory performance Summit 2018, Santa Clara, CA 1st GC 2nd GC 3rd GC Host Writes Write Seq. Hot SB Warm SB Host Writes Chunk X Chunk Y Chunk Z Superblock (SB) WLI: 0 WLI: 33% WLI: 90% GC SB Cold? SB Cold SB SSD SSD 17
Interface Summary The concepts together provide I/O Isolation through the use of Groups & Parallel Units Fine-grained data refresh managed by the host Reduce write amplification by enabling host to place hot/cold data efficiently DRAM & Over-provisioning reduction through append-only Chunks Direct-to-media to avoid expensive internal data movement Specification available at http://lightnvm.io 18
Flexible IO Tester (fio) Eco-system Large eco-system through Zoned Block Devices and OCSSD Linux Kernel NVMe Device Driver Detection of OCSSDs Support for 1.2 and 2.0 specification Register with LightNVM subsystem Register as a Zoned Block Devices (patches available) LightNVM Subsystem Core functionality Target management Target interface Enumerate, get geometry, I/O interface, etc. pblk host-side FTL Map OCSSD to Block Device User-space libzbc, fio (ZBD support), liblightnvm SPDK User Space Regular File- Systems (xfs) Logical Block Device (pblk) LightNVM Subsystem OCSSD2 Applications File-System with SMR Support (f2fs, btrfs) NVMe Driver Block Layer Applications liblightnvm Linux Kernel 19
Open-Source Software Contributions Initial release of subsystem with Linux kernel 4.4 (January 2016). User-space library (liblightnvm) support upstream in Linux kernel 4.11 (April 2017). pblk available in Linux kernel 4.12 (July 2017). Open-Channel SSD 2.0 specification released (January 2018) and support available from Linux kernel 4.17 (May 2018). SPDK Support for OCSSD (June 2018) Fio with Zone support (August 2018) Upcoming OCSSD as a Zoned Block Device (Patches available) RAIL XOR support for lower latency 2.0a revision 9/19/2018 20
Tools and Libraries LightNVM: The Linux Open-Channel SSD Subsystem https://www.usenix.org/conference/fast17/technical-sessions/presentation/bjorling LightNVM http://lightnvm.io LightNVM Linux kernel Subsystem https://github.com/openchannelssd/linux liblightnvm https://github.com/openchannelssd/liblightnvm QEMU NVMe with Open-Channel SSD Support https://github.com/openchannelssd/qemu-nvme 21
Western Digital and the Western Digital logo are registered trademarks or trademarks of Western Digital Corporation or its affiliates Storage Developer Conference 2018, Santa Clara, CA in the US and/or other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. The NVMe word mark is a trademark of NVM Express, Inc. All other marks are the property of their respective owners. 22