Denali Open-Channel SSDs

Similar documents
pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

Open-Channel SSDs Then. Now. And Beyond. Matias Bjørling, March 22, Copyright 2017 CNEX Labs

Introduction to Open-Channel Solid State Drives

Introduction to Open-Channel Solid State Drives and What s Next!

Linux Kernel Extensions for Open-Channel SSDs

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

Linux Kernel Abstractions for Open-Channel SSDs

How Next Generation NV Technology Affects Storage Stacks and Architectures

Computational Storage: Acceleration Through Intelligence & Agility

Toward a Memory-centric Architecture

Enabling NVMe I/O Scale

Open Channel Solid State Drives NVMe Specification

White Paper: Understanding the Relationship Between SSD Endurance and Over-Provisioning. Solid State Drive

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Replacing the FTL with Cooperative Flash Management

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Evolution of Rack Scale Architecture Storage

It Takes Guts to be Great

liblightnvm The Open-Channel SSD User-Space Library Simon A. F. Lund CNEX Labs

Why NVMe/TCP is the better choice for your Data Center

White Paper: Increase ROI by Measuring the SSD Lifespan in Your Workload

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NVMe SSDs with Persistent Memory Regions

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Adrian Proctor Vice President, Marketing Viking Technology

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Radian MEMORY SYSTEMS

Flashed-Optimized VPSA. Always Aligned with your Changing World

We will also specifically discuss concept of a pooled system, storage node, pooling of PCIe as well as NVMe based storage.

Purity: building fast, highly-available enterprise flash storage from commodity components

Provisioning Intel Rack Scale Design Bare Metal Resources in the OpenStack Environment

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

N V M e o v e r F a b r i c s -

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

IBM Db2 Analytics Accelerator Version 7.1

LightNVM: The Linux Open-Channel SSD Subsystem

Storage Protocol Offload for Virtualized Environments Session 301-F

NVMe over Universal RDMA Fabrics

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

RocksDB on Open-Channel SSDs. Javier González RocksDB Annual Meetup'15 - Facebook

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

Enhancing NVMe-oF Capabilities Using Storage Abstraction

SSD ENDURANCE. Application Note. Document #AN0032 Viking SSD Endurance Rev. A

VSSIM: Virtual Machine based SSD Simulator

Accelerating Data Centers Using NVMe and CUDA

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Oracle Exadata: Strategy and Roadmap

Deep Learning Performance and Cost Evaluation

Application-Managed Flash

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Persistent Memory over Fabrics

Open-Channel Solid State Drives Specification

Designing Next Generation FS for NVMe and NVMe-oF

Experimental Results of Implementing NV Me-based Open Channel SSDs

#techsummitch

VMware vsphere Clusters in Security Zones

An NVMe-based FPGA Storage Workload Accelerator

QuickSpecs. PCIe Solid State Drives for HP Workstations

Technical White Paper: IntelliFlash Architecture

Deep Learning Performance and Cost Evaluation

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

SolidFire and Ceph Architectural Comparison

Enabling the NVMe CMB and PMR Ecosystem

vsan Security Zone Deployment First Published On: Last Updated On:

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

How Architecture Design Can Lower Hyperconverged Infrastructure (HCI) Total Cost of Ownership (TCO)

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

NVMe SSD s. NVMe is displacing SATA in applications which require performance. NVMe has excellent programing model for host software

Important new NVMe features for optimizing the data pipeline

Atlas: Baidu s Key-value Storage System for Cloud Data

NVM Express 1.3 Delivering Continuous Innovation

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

PowerVault MD3 SSD Cache Overview

Designing SSDs for large scale cloud workloads FLASH MEMORY SUMMIT, AUG 2014

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

IO virtualization. Michael Kagan Mellanox Technologies

Opportunities from our Compute, Network, and Storage Inflection Points

Creating Storage Class Persistent Memory With NVDIMM

Using Synology SSD Technology to Enhance System Performance Synology Inc.

NVM Express Awakening a New Storage and Networking Titan Shaun Walsh G2M Research

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

QuickSpecs. PCIe Solid State Drives for HP Workstations

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Building dense NVMe storage

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Annual Update on Flash Memory for Non-Technologists

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

FlashKV: Accelerating KV Performance with Open-Channel SSDs

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

SFS: Random Write Considered Harmful in Solid State Drives

Solid State Drive (SSD) Cache:

SSD WRITE AMPLIFICATION

Transcription:

Denali Open-Channel SSDs Flash Memory Summit 2018 Architecture Track Javier González

Open-Channel SSDs Definition: A class of Solid State Drives that expose (some of) their geometry to the host and allow it to control (part of) its internals through a richer R/W/E interface. - The design space is very large, from pure physical access to restricted access rules. Objective: Move storage closer to the application host data placement & scheduling - Reduce Write Amplification Factor (WAF) and Space Amplification Factor (SAF) - Make media parallelism available to the host device provides isolation across parallel units Term introduced at Baidu paper [1] to optimize their KV-store by accessing physical media. Different types of OCSSDs implemented in the industry: Fusion I/O, Violin Memory and others. Several specifications available: - Open-Channel SSD 1.2 and 2.0 [2] (Support in Linux Kernel through LightNVM since 4.4 and 4.17) - Alibaba s Dual-Mode SSD [3] [1] An efficient design and implementation of LSM-tree based key-value store on open-channel SSD (Eurosys 14), Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong [2] lightnvm.io [3] In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD (Alibaba white paper) Yu Du, Ping Zhou, Shu Li 2

Denali Open-Channel SSDs Industry standard for a specific point in the Open-Channel SSD design space - Goal: Minimize host & device changes across NAND generations to speed up new media adoption - Device manages the physical media: retention, ECC, access policies, program patterns, etc. - Host controls data placement and I/O scheduling - Media access is abstracted through access rules align with zoned devices - Feedback loop for device -> host communication Several companies involved in the Joint Development Forum (JDF). Representation from: - NAND, controller and SSD vendors - Cloud Service Providers (CSPs) - Enterprise storage Start point at Open-Channel 2.0 specification - Continue work done from 1.2 to 2.0 - Address issues / limitations in 2.0 - Incorporate feedback from the JDF for industry-wide standard - Pave the way for incorporating in NVMe standard 3

Design Space Host LUN Data Placement I/O Scheduling Media Management SSD Controller LUN LUN LUN LUN LUN LUN LUN LUN LUN Traditional SSD LBA range L2P Table Active NAND Mgmt. Error Handling Host Data Placement I/O Scheduling Media Management Health Management Open-Channel SSD 2.0 / Denali Read Error Handling L2P Table Active NAND Mgmt. PU PU PU PU PU PU PU PU PU OCSSD Controller LUN Active NAND Mgmt. LUN LUN LUN LUN LUN LUN LUN LUN LUN Host LUN Data Placement I/O Scheduling Media Management OCSSD Controller Open-Channel SSD (1.2) L2P Table Active NAND Mgmt. Error Handling LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN LUN - Mimic HDDs and maintain block abstraction for adoption - HDDs already moving towards zonebased block devices (SMR) - Simplify device by removing mapping layer maintain media - Remove block abstraction to enable host place and schedule - Host rules similar to zone-devices - Very simple device - Ideal for fast prototyping - Productize single NAND / vendor SSD SSD Design Space (increasing host responsibilities) 4

Geometry: Nomenclature - Logical Block: Minimum addressable unit for reads. - Chunk: Collection of logical blocks. Minimum addressable unit for resets (erases) - Parallel Unit (PU): Collection of Chunks that must be accessed serialized. Typically, a device counts on tens / hundreds of these parallel units - Group: Collection of PUs that share the same transfer bus on the device. Tenant: Logical construction to represent a collection of chunks that are used by the host for a given purpose. This abstraction allows to manage WAF, SAF, isolation and parallelism - A device can be partitioned into several workload-optimized tenants. - When a tenant is composed by a whole PU, the tenant is guaranteed I/O isolation aggregated bandwidth does not have an impact on per-tenant latencies. - When none of the tenants share chunks, very little WAF is generated - applications good at managing sequential writes and hot / cold data separation. - Per-tenant over-provision allows to minimize SAF better $$ per TB. 5

Key Challenges* * The Denali spec. is not finalized

Device Warranty Without any form of protection, the host can create hot spots and wear them out very quickly Current warranty metrics do not apply since the host manages data placement - DWPD and TBW are device-wide - Devices are optimized for a given workload different over-provision, DWPD, etc. - Current reliability / endurance standards are device wide (e.g., JEDEC) Path: Host is king, but it must behave! - Reuse today s warranty concepts, but at a lower granularity (chunk) - Add protection mechanisms for misbehaved hosts (e.g., prevent double erases*) - Report wear unevenness (over given threshold) to the host through feedback loop* - Maintain a dedicated log for warranty purposes. Vendor ground truth - Incorporate Denali to current reliability / endurance standards - Create tools to verify host s and device s behavior * Present in Open-Channel 2.0 7

Reservations The host can organize parallel units (PUs) into different tenants to provide I/O isolation Tenants are not visible to the device - Feedback loop can only report wear unevenness at a chunk granularity - Uneven wear across tenants is expected - Device-side per-tenant manipulation is not possible (e.g., RAID) Path: Incorporate reservations - Allow host to create / delete tenants that are visible to the device Host is aware of media boundaries: Group, PU and chunk - Enable per-tenant feedback notifications Different urgency and scope - Narrow down device responsibilities for different types of tenants - Support for vertical, horizontal and hybrid configurations: manage isolation vs. capacity - NVMe alignment: Endurance groups & NVM Sets 8

Device side RAID RAID is strongly recommended for some NAND by the vendors, independently of used ECC - Prevent infant block death - Reach acceptable PE Cycle count RAID has purposely being left out of available specifications due to complexity Different needs - CSPs deals with single device errors at higher level: erasure coding, software RAID, etc. - Enterprise require per device guarantees: RAID within drives and across drives Path: Support RAID as part of reservation interface - Denali should address cloud and enterprise; leaving RAID undefined (i.e., VU) will create fragmentation - RAID schemes should be reported by the device flexibility and controlled complexity - Enabled in Denali as part of the reservations create_tenant takes PUs and raid scheme as parameters - User-specific parity schemes can be accelerated other places (in a few slides) 9

NVMe Alignment The ultimate purpose is to standardize the core ideas of Denali OC in NVMe - It is necessary for wide industry adoption - Helps reducing fragmentation - Makes further integration in the storage stack easier - Changes are expected 10

Architecture

Architecture Host Management Application Application Application FTL FTL Denali Management Host Host manages storage entirely - Use of CPU and memory - Need for dedicated drivers * Devices expose Denali interface Need for Denali-OF for disaggregation PCIe Interc. + NVMe NIC + NVMeOF Fabric Storage Subsystem Storage Subsystem Denali SSD Denali SSD Denali SSD Denali SSD * Support in Linux through LightNVM subsystem (from kernel 4.4) 12

Adoption Challenges & Insights Service providers are not willing to dedicate extra host resources for managing storage - Resources are billable to the application - FTLs are expensive in terms of both CPU and memory Ø Denali must enable both (i) lightweight host and (ii) FTL offload Most applications can excel with simpler abstractions than raw media access - Error path is the critical path raw media does fail Ø Denali already abstracts media. More complex constructions can be built on top Considerable gains by moving computation and transfers closer to the data while maintaining a simple interface at the host (i.e., computational storage) - RAID, erasure coding, deduplication, compression, etc. - Analytics, calculations, DB operations, etc. - Peer-to-peer PCIe, RDMA Ø Denali is a flexible storage offload for different classes of applications Disaggregation must be simple - Management + transparent I/O path Ø Denali can build on top of work done on NVMeOF 13

Architecture Offloaded Management Application Application Application Block Device KV AOs Host uses a lightweight API - Traditional block devices - Key-Value stores - Dedicated APIs, e.g., append-only streams (AOs) Host Block Device FTL PCIe Interc. + NVMe KV FTL Storage / Networking Accelerator Denali Management PCIe Interc. + NVMe AOs FTL Aggregator manages storage - Billable CPU and memory not shared with the application - No need for Denali OC support in host s OS - Allows to offload computation closer to the data on parts owned by the service provider - Vendor Unique commands allow to use NVMe PCIe attached allows fast adoption for dedicated storage APIs on local storage Fabrics attached enables typical NVMeOF disaggregation (next slide)

Architecture Offloaded Management - OF Application Application Application Block Device KV AOs Application Application Application KV KV AOs Host Host NIC + NVMeOF Initiator NIC + NVMeOF Initiator Fabric NIC + NVMeOF Target NIC + NVMeOF Target Block Device FTL KV FTL Block Device FTL KV FTL AOs FTL Denali Management Storage / Networking Accelerator Denali Management Storage / Networking Accelerator PCIe Interc. + NVMe PCIe Interc. + NVMe 15

Host Application Application Application Block Device KV AOs Block Device FTL NIC + NVMeOF Initiator Fabric NIC + NVMeOF Target KV FTL Denali Management Storage / Networking Accelerator PCIe Interc. + NVMe AOs FTL Back to Software APIs Storage protocols increase in complexity by adding vendor-, user-dedicated extensions - Solve a problem under a set of assumptions - Require dedicated support from SSD vendors - Once standardized, need to carry it forever - Once deployed, it cannot be easily modified Denali OC allows for these extensions to become APIs - Easy to prototype, test, deploy and modify - Much shorter deployment time (years to weeks) - Still allow for multiple sources - NVMe can be used as protocol through VU commands (which remain between host and accelerator) - Storage accelerators simplify this further by encapsulating Denali OC-specific management, leaving host untouched Offloads, HW accelerations, user-specific secret sauce, etc. Denali is not the ultimate spec. of specs., but it can help to reduce workload-specific extensions in NVMe 16

Case Study: Append-Only Streams (Demo) Host AOs Application NIC + NVMeOF Initiator AOs Application Fabric NIC + NVMeOF Target libaos liblightnvm Denali Management Storage / Networking Accelerator PCIe Interc. + NVMe AOs Application Append-Only Streams libaos - Spec. developed by Microsoft to (i) minimize WAF, (ii) reduce device DRAM and (iii) speed up denser NAND adoption. - Extend the concept of directives (i.e., I/O streams) Write sequential following rules, read random Minimize data movement during GC Management interface: per-stream properties (e.g., isolation) - Built entirely on top of liblightnvm (< 1500 LOC) - Run transparently on OCSSD 1.2, 2.0 and Denali Accelerator - Hides libaos implementation details - Does not require SSDs to support AOs natively - AOs spec. can change and libaos can be improved through software update No need to replace any piece of HW Avoid new SKUs and drive re-qualification 17

Conclusions Different types of Open-Channel SSDs are gaining significant traction - Increasing demand both by CSPs and enterprise - Different flavors available - Different vendors with SDKs and prototypes Ø Denali JDF is a step towards reducing fragmentation NVMe is the goal Accelerator platforms for storage offload are a good fit for Denali OC adoption - Do not spend host resources managing storage - Facilitate offloads and provide them with a richer storage interface - Leverage existing ecosystem - Help to keep a lightweight NVMe protocol promote the use of software APIs Growing ecosystem with more companies contributing - CNEX Labs, Western Digital, Intel, Red Hat and others - lightnvm.io: documentation, research articles, github repositories and examples 18

CNEX Labs, Inc. Visit us at the CNEX Labs room. Request demo: info@cnexlabs.com Denali: - Denali-OF Tenant Isolation with Broadcom s Stingray - Denali-OF Tenant Isolation with Mellanox s Bluefield - Append-Only streams library (libaos) on Denali using liblightnvm