Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

Similar documents
SMB3 Extensions for Low Latency. Tom Talpey Microsoft May 12, 2016

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Windows Support for PM. Tom Talpey, Microsoft

Windows Support for PM. Tom Talpey, Microsoft

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

RDMA Requirements for High Availability in the NVM Programming Model

Microsoft SMB Looking Forward. Tom Talpey Microsoft

PERSISTENT MEMORY PROGRAMMING

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

NFS/RDMA Next Steps. Chuck Lever Oracle

High Performance File Serving with SMB3 and RDMA via SMB Direct

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

NVMe over Universal RDMA Fabrics

Application Access to Persistent Memory The State of the Nation(s)!

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Persistent Memory over Fabrics

Accessing NVM Locally and over RDMA Challenges and Opportunities

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

Storage Protocol Offload for Virtualized Environments Session 301-F

Design Considerations When Implementing NVM

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

Windows Persistent Memory Support

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

SMB3 Update David Kruse Microsoft

Application Acceleration Beyond Flash Storage

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Architectural Principles for Networked Solid State Storage Access

NVMe Direct. Next-Generation Offload Technology. White Paper

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Planning For Persistent Memory In The Data Center. Sarah Jelinek/Intel Corporation

Linux Support of NFS v4.1 and v4.2. Steve Dickson Mar Thu 23, 2017

NFS/RDMA. Tom Talpey Network Appliance

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Best Practices for Deployments using DCB and RoCE

How Next Generation NV Technology Affects Storage Stacks and Architectures

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Multifunction Networking Adapters

Annual Update on Flash Memory for Non-Technologists

WINDOWS SERVER 2012 ECHOSTREAMS FLACHESAN2

SurFS Product Description

Technology Advancement in SSDs and Related Ecosystem Changes

An Approach for Implementing NVMeOF based Solutions

RoCE Accelerates Data Center Performance, Cost Efficiency, and Scalability

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Building Durable Real-time Data Pipeline

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

SDC EMEA 2019 Storage Developer Conference - Israel Develop your storage knowledge 30th January 2019 Sheraton Hotel, Tel Aviv

The SNIA NVM Programming Model. #OFADevWorkshop

Networking at the Speed of Light

Why NVMe/TCP is the better choice for your Data Center

InfiniBand Networked Flash Storage

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Flashed-Optimized VPSA. Always Aligned with your Changing World

Enterprise Architectures The Pace Accelerates Camberley Bates Managing Partner & Analyst

Designing Next Generation FS for NVMe and NVMe-oF

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

NVMe Over Fabrics (NVMe-oF)

EXPERIENCES WITH NVME OVER FABRICS

NVM PCIe Networked Flash Storage

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc.

Software-Defined Data Infrastructure Essentials

NVM Express Awakening a New Storage and Networking Titan Shaun Walsh G2M Research

NVM Express 1.3 Delivering Continuous Innovation

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

What a Long Strange Trip It s Been: Moving RDMA into Broad Data Center Deployments

I N V E N T I V E. SSD Firmware Complexities and Benefits from NVMe. Steven Shrader

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Oliver Basarke

Enabling the NVMe CMB and PMR Ecosystem

SMB 3.0 (Because 3 > 2) David Kruse Microsoft

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Opportunities from our Compute, Network, and Storage Inflection Points

NVMe SSDs with Persistent Memory Regions

Performance Implications Libiscsi RDMA support

memory VT-PM8 & VT-PM16 EVALUATION WHITEPAPER Persistent Memory Dual Port Persistent Memory with Unlimited DWPD Endurance

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Offloaded Data Transfers (ODX) Virtual Fibre Channel for Hyper-V. Application storage support through SMB 3.0. Storage Spaces

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Low latency and high throughput storage access

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Scale your Data Center with SAS Marty Czekalski Market and Ecosystem Development for Emerging Technology, Western Digital Corporation

Exam Objectives for MCSA Installation, Storage, and Compute with Windows Server 2016

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

VCP GA, SC, NC, AL, FL

Transcription:

Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

Outline Problem Statement RDMA Storage Protocols Today Sources of Latency RDMA Storage Protocols Extended Other Protocols Needed 2

Related SDC2015 Talks Monday Neal Christiansen Tuesday Jim Pinkerton, Andy Rudoff, Doug Voigt Wednesday Chet Douglas Thursday Paul von Behren 3

Problem Statement 4

RDMA-Aware Storage Protocols Focus of this talk Enterprise / Private Cloudcapable storage protocols Scalable, manageable, broadly deployed SMB3 with SMB Direct NFS/RDMA iser Many others exist Including NVM Fabrics, but not the focus here 5

New Storage Technologies Emerging Advanced block devices I/O bus-attached: PCIe, SSD, NVMe, Block or future Byte addressable Storage Class Memory ( PM Persistent Memory) Memory bus attached NVDIMM, Block or Byte accessible Emerging persistent memory technologies 3D XPoint, PCM, In various form factors 6

Storage Latencies Decreasing Write latencies of storage protocols (e.g. SMB3) today down to 30-50us on RDMA Good match to HDD/SSD Stretch match to NVMe PM, not so much Storage workloads are traditionally highly parallel Latencies are mitigated But workloads are changing: Write replication adds a latency hop Write latency critical to reduce Technology Latency (high) Latency (low) IOPS HDD 10 msec 1 msec 100 SSD 1 msec 100 µsec 100K NVMe 100 µsec 10 µsec (or better) PM < 1 µsec (~ memory speed) Orders of magnitude decrease 500K+ BW/size (>>1M/DIMM) 7

New Latency-Sensitive Workloads Writes! Small, random Virtualization, Enterprise applications MUST be replicated and durable A single write creates multiple network writes Reads Small, random are latency sensitive Large, more forgiving But recovery/rebuild are interesting/important 8

Writes, Replication, Network Writes (with possible erasure coding) greatly multiplies network I/O demand The 2-hop issue All such copies must be made durable before responding Therefore, latency is critical! Commit Erasure Code Write

APIs and Latency APIs also shift the latency requirement Traditional Block and File are often parallel Memory Mapped and PM-Aware APIs not so much Effectively a Load/Store expectation Memory latency, with possibly expensive Commit Local caches can improve Read (load) but not Write (store/remotely durable) 10

RDMA Storage Protocols Today 11

Many Layers Are Involved Storage layers SMB3 and SMB Direct NFS, pnfs and NFS/RDMA iscsi and iser RDMA Layers iwarp RoCE, RoCEv2 InfiniBand I/O Busses Storage (Filesystem, Block e.g. SCSI, SATA, SAS, ) PCIe Memory 12

SMB3 Architecture (shameless plug) Principal Windows remote filesharing protocol Also an authenticated, secure, multichannel, RDMAcapable session layer Transport for File system operations (REFS, NTFS, etc) Block operations (VHDX, RSVD, EBOD ) Hyper-V Live Migration (VM memory) RPC (Named Pipes) Future transport for Backend NVMe storage Persistent Memory 13

SMB3 Components (example) Client Server Storage Providers Storage Backends Application. Win32 API SMB3 Server (SRV) SCSI VHD, RSVD, EBOD Filesystem S C S I HDD/SSD NVMe Block mode PM SMB3 Redirector (RDR) Raw mode PM TCP RDMA TCP RDMA PM Direct? Mapped File Multichannel Hyper-V Live Migration Guest Memory 14

Contributors to Latency 15

RDMA Transfers Storage Protocols Today Direct placement model (simplified and optimized) Register WRITE Send Client advertises RDMA region in scatter/gather list (Register) Server performs all RDMA DATA RDMA Read (with local invalidate) More secure: client does not access server s memory Send (with invalidate) More scalable: server does not preallocate to client Faster: for parallel (typical) storage workloads SMB3 uses for READ and WRITE Server ensures durability Client Register RDMA Write READ Send DATA Server NFS/RDMA, iser similar Interrupts and CPU on both sides Send (with invalidate) 16

Latencies Undesirable latency contributions Interrupts, work requests Server request processing Server-side RDMA handling CPU processing time Request processing I/O stack processing and buffer management To traditional storage subsystems Data copies Can we reduce or remove all of the above to PM? 17

RDMA Storage Protocols Extended 18

Push Mode (Schematic) Enhanced direct placement model Client requests server resource of file, memory region, etc MAP_REMOTE_REGION(offset, length, mode r/w) Server pins/registers/advertises RDMA handle for region Client performs all RDMA RDMA Write to region RDMA Read from region ( Pull mode ) No requests of server (no server CPU/interrupt) Achieves near-wire latencies Client remotely commits to PM (new RDMA operation!) Ideally, no server CPU interaction RDMA NIC optionally signals server CPU Operation completes at client only when remote durability is guaranteed Client periodically updates server via master protocol E.g. file change, timestamps, other metadata Server can call back to client To recall, revoke, manage resources, etc Remote Direct Access Send Register Send RDMA Write DATA RDMA Write DATA RDMA Read DATA Send Unregister Send Push RDMA Commit (new) Pull Client signals server (closes) when done 19

Storage Protocol Extensions SMB3 NFSv4.x iser 20

SMB3 Push Mode (hypothetical) Setup a new create context or FSCTL Server registers and advertises w/r file by Handle Or, directly to a region of PM or NVMe-style device! Takes a Lease or lease-like ownership Write, Read RDMA access by client Client writes and reads directly via RDMA Commit Client requests durability Perform Commit, via RDMA with optional server processing SMB_FLUSH-like processing for signaling if needed/desired Callback Server manages client access Similar to current oplock/lease break Finish Client access complete SMB_CLOSE, or lease manipulation 21

NFS/RDMA Push Mode (hypothetical) Setup a new NFSv4.x Operation Server registers and advertises w/r file or region by filehandle Offers Delegation or Via pnfs layout? (!) Write, Read RDMA access by client Client writes and reads via RDMA Commit Client requests durability Perform Commit, via RDMA with optional server processing NFS4_COMMIT-like processing for signaling if needed/desired Callback via backchannel Similar to current delegation or layout recall Finish NFS4_CLOSE, or delegreturn or layoutreturn (if pnfs) 22

iscsi (iser) Push Mode (very hypothetical) Setup a new iser operation Target registers and advertises w/r buffer(s) Write a new Unsolicited SCSI-In operation Implement RDMA Write within initiator to target buffer No Target R2T processing Read a new Unsolicited SCSI-Out operation Implement RDMA Read within initiator from target buffer No Target R2T processing Commit a new iser / modified iscsi operation Perform Commit, via RDMA with optional Target processing Leverage FUA processing for signaling if needed/desired Callback a new SCSI Unit Attention??? Finish a new iser operation 23

Other Protocols Extended 24

RDMA protocols Need a remote guarantee of Durability RDMA Write alone is not sufficient for this semantic Completion at sender does not mean data was placed NOT that it was even sent on the wire, much less received Some RNICs give stronger guarantees, but never that data was stored remotely Processing at receiver means only that data was accepted NOT that it was sent on the bus Segments can be reordered, by the wire or the bus Only an RDMA completion at receiver guarantees placement And placement!= commit/durable No Commit operation Certain platform-specific guarantees can be made But client cannot know them See Chet s presentation later today! 25

RDMA protocol extension Two obvious possibilities RDMA Write with placement acknowledgement Advantage: simple API set a push bit Disadvantage: significantly changes RDMA Write semantic, data path (flow control, buffering, completion) Requires significant changes to RDMA Write hardware design And also to initiator work request model (flow controlled RDMA Writes would block the send work queue) Undesirable RDMA Commit New operation, flow controlled/acknowledged like RDMA Read or Atomic Disadvantage: new operation Advantage: simple API flush, operates on one or more STags/regions (allows batching), preserves existing RDMA Write semantic (minimizing RNIC implementation change) Desirable 26

RDMA Commit (concept) RDMA Commit New wire operation Implementable in iwarp and IB/RoCE Initiating RNIC provides region list, other commit parameters Under control of local API at client/initiator Receiving RNIC queues operation to proceed in-order Like RDMA Read or Atomic processing currently Subject to flow control and ordering RNIC pushes pending writes to targeted regions If not tracking regions, then flushes all writes RNIC performs PM commit Possibly interrupting CPU in current architectures Future (highly desirable to avoid latency) perform via PCIe RNIC responds when durability is assured 27

Local RDMA API extensions (concept) New platform-specific attributes to RDMA registration To allow them to be processed at the server *only* No client knowledge ensures future interop New local PM memory registration Register(region[], PMType, mode) PMType includes type of PM i.e. plain RAM, commit required, PCIe-resident, any other local platform-specific processing Mode includes disposition of data Read and/or write Cacheable after operation Resulting handle sent by peer Commit, to be processed in receiving NIC under control of original Mode 28

PCI Protocol Extension PCI extension to support Commit To Memory, CPU, PCI Root, PM device, PCIe device, Avoids CPU interaction Supports strong data consistency model Performs equivalent of: CLFLUSHOPT (region list) PCOMMIT (See Chet s presentation!) 29

Expected Goal 30

Latencies Single-digit microsecond remote Write+Commit (See Chet s presentation for estimate details) Push mode minimal write latencies (2-3us + data wire time) Commit time NIC-managed and platform+payload dependent Remote Read also possible Roughly same latency as write, but without commit No server interrupt Once RDMA and PCIe extensions in place Single client interrupt Moderation and batching can reduce when pipelining Deep parallelism with Multichannel and flow control management 31

Open questions Getting to the right semantic? Discussion in multiple standards groups (PCI, RDMA, Storage, ) How to coordinate these discussions? Implementation in hardware ecosystem Drive consensus from upper layers down to lower layers! What about new API semantics? Does NVML add new requirements? What about PM-aware filesystems (DAX/DAS)? Other semantics or are they Upper Layer issues? Authentication? Integrity/Encryption? Virtualization? 32

Discussion? 33