PERSISTENT MEMORY PROGRAMMING

Similar documents
Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Microsoft SMB Looking Forward. Tom Talpey Microsoft

Windows Support for PM. Tom Talpey, Microsoft

Windows Support for PM. Tom Talpey, Microsoft

SMB3 Extensions for Low Latency. Tom Talpey Microsoft May 12, 2016

The SNIA NVM Programming Model. #OFADevWorkshop

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

RDMA Requirements for High Availability in the NVM Programming Model

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

Windows Persistent Memory Support

Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

Application Access to Persistent Memory The State of the Nation(s)!

Persistent Memory over Fabrics

Persistent Memory, NVM Programming Model, and NVDIMMs. Presented at Storage Field Day June 15, 2017

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

APIs for Persistent Memory Programming

Impact on Application Development: SNIA NVM Programming Model in the Real World. Andy Rudoff pmem SW Architect, Intel

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

Design Considerations When Implementing NVM

PMEM Software and Applications

The SNIA NVM Programming Model: Latest Developments and Challenges. Andy Rudoff, Intel Corporation

High Performance File Serving with SMB3 and RDMA via SMB Direct

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

An RDMA Protocol Specification (Version 1.0)

Accessing NVM Locally and over RDMA Challenges and Opportunities

Architectural Principles for Networked Solid State Storage Access

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Advanced Computer Networks. End Host Optimization

Paving the Way to the Non-Volatile Memory Frontier. PRESENTATION TITLE GOES HERE Doug Voigt HP

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

MOVING FORWARD WITH FABRIC INTERFACES

Flash Memory Summit Persistent Memory - NVDIMMs

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

An NVMe-based FPGA Storage Workload Accelerator

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

How Next Generation NV Technology Affects Storage Stacks and Architectures

I want to build on Rick Coulson s talk earlier this morning that addressed emerging Persistent Memory technologies. I want to expand on implications

Persistent Memory: The Value to HPC and the Challenges

Multifunction Networking Adapters

Application Acceleration Beyond Flash Storage

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

New types of Memory, their support in Linux and how to use them with RDMA

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Capabilities and System Benefits Enabled by NVDIMM-N

VMware vsphere Virtualization of PMEM (PM) Richard A. Brunner, VMware

Q.1 Explain Computer s Basic Elements

NVDIMM Block Window Driver Writer s Guide

An FPGA-Based Optical IOH Architecture for Embedded System

Computer System Overview

Anand Raghunathan

What You can Do with NVDIMMs. Rob Peglar President, Advanced Computation and Storage LLC

by Brian Hausauer, Chief Architect, NetEffect, Inc

Planning For Persistent Memory In The Data Center. Sarah Jelinek/Intel Corporation

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Important new NVMe features for optimizing the data pipeline

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

NVMe SSDs with Persistent Memory Regions

I/O Buffering and Streaming

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

The Impact of the NVM Programming Model Andy Rudoff Intel Corporation

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

Advanced Operating Systems (CS 202) Virtualization

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Opportunities from our Compute, Network, and Storage Inflection Points

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Messaging Overview. Introduction. Gen-Z Messaging

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Intel QuickAssist Technology

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Networking at the Speed of Light

PARAVIRTUAL RDMA DEVICE

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

GPUfs: Integrating a file system with GPUs

NVDIMM Overview. Technology, Linux, and Xen

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Using NVDIMM under KVM. Applications of persistent memory in virtualization

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Hardware Support for NVM Programming

6.9. Communicating to the Outside World: Cluster Networking

This presentation covers Gen Z Memory Management Unit (ZMMU) and memory interleave capabilities.

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

The NE010 iwarp Adapter

Managing Persistent Memory Tiffany Kasanicky Intel

Re-architecting Virtualization in Heterogeneous Multicore Systems

Non-Volatile Memory Through Customized Key-Value Stores

Transcription:

14th ANNUAL WORKSHOP 2018 PERSISTENT MEMORY PROGRAMMING THE REMOTE ACCESS PERSPECTIVE Tom Talpey, Architect Microsoft April 10, 2018

OUTLINE SNIA NVMP Programming Model PMEM Remote Access considerations New RDMA Semantics Extensions in support of PMEM at minimum latency Windows PMEM Support Support for PMEM, and future possibilities 2

SNIA NVMP PROGRAMMING MODEL 3

PERSISTENT MEMORY VOLUME AND FILE Includes Block, File, Volume and Persistent Memory (PM) File Use with memory-like NVM NVM.PM.VOLUME Mode Software abstraction to OS components for Persistent Memory (PM) hardware List of physical address ranges for each PM volume Thin provisioning management NVM.PM.FILE Mode Describes the behavior for applications accessing persistent memory Discovery and use of atomic write features Mapping PM files (or subsets of files) to virtual memory addresses Syncing portions of PM files to the persistence domain User space Kernel space NVM.PM.VOLUME mode PM device NVM.PM.FILE mode PM-aware kernel module NVM PM capable driver PM device Application Native file API PM-aware file system... Load/ store PM device MMU Mappings Memory Mapping in NVM.PM.FILE mode enables direct access to persistent memory using CPU instructions 4

MORE ON MAP AND SYNC Sync does not guarantee order Map Associates memory addresses with open file Caller may request specific address Sync Flush CPU cache for indicated range Additional Sync types Optimized Flush multiple ranges from user space Async Flush / Drain Optimized flush with later completion wait Warning! Sync does not guarantee order Parts of CPU cache may be flushed out of order This may occur before the sync action is taken by the application Sync only guarantees that all data in the indicated range has been flushed some time before the sync completes 5

REMOTE ACCESS FOR HA SOFTWARE MODEL RDMA for HA During msync or opt_flush Application Peer A Peer B NVM.PM.FILE mode Native Native Filefile API API Opt Opt_flush Flush Load/Store Load/store libc msync libpmem opt_flush RDMA RDMA Operation Operation Requests Requests User space Kernel space PM-aware file system MMU Mappings Network file system client PM device RDMA Data RNIC 6

ASYNCHRONOUS FLUSH In process of definition in SNIA NVMP TWG AsyncFlush() is OptimizedFlush() which does not wait for drain Local X86 analogue: clflush/clflushopt Subsequent Drain() call required Local X86 analogue: sfence Anticipated use by overlapped (parallel) programming Especially: Remote programming 7

ASYNCHRONOUS FLUSH SEMANTICS Async Flush defines ordering of durability, and nothing more Consistency (visibility) vs Persistency (durability) Relaxed order memory API Load/Store to visibility Relaxed order to memory Flush/Drain to durability Ordered only to prior stores Load Store Async Flush Drain Store buffer Cache hierarchy Load/Store Volatile memory Load/Store Persistent memory Relaxed ordering 8

LOCAL IMPLEMENTATION Load => load Store => store Async flush => pmem_flush x86-64: clflush to region(s) which are target of flush Drain => pmem_drain x86-64: sfence to wait for store pipeline(s) to complete 9

REMOTE IMPLEMENTATION Load => load Store => store Async flush => RDMA Write RDMA Write from region(s) which are target of flush Note: page-fault model (at time of load/store) far too expensive and imprecise Efficiently leverages relaxed ordering semantic to defer RDMA Write NVMP Interface facilitates this by providing region list Async method allows application to invoke as overlapped giddy-up Drain => RDMA Flush RNIC RDMA Flush to all queuepairs written-to by the RDMA Write(s) Note: region or segment scope under discussion in future See following subsection for RDMA Flush definition 10

NEW RDMA SEMANTICS 11

RDMA PROTOCOLS Need a remote guarantee of Durability In support of OptimizedFlush()/ AsyncFlush() RDMA Write alone is not sufficient for this semantic Guarantees remote visibility, but not final memory residency (durability) Similar to CPU cache => memory semantic Does not guarantee ordered delivery w.r.t. other operations ( post only) An extension is desired Proposed RDMA Commit, a.k.a. RDMA Flush Executes as non-posted operation Ordered, Flow controlled, acknowledged (like RDMA Read or Atomic) Initiator requests specific byte ranges to be made durable Responder acknowledges only when durability complete Strong consensus on these basics 12

RDMA FLUSH (CONCEPT) New wire operation, and new verb Implementable in iwarp and IB/RoCE Initiating RNIC provides region, other commit parameters Under control of local API at client/initiator Receiving RNIC queues operation to proceed in-order Like non-posted RDMA Read or Atomic processing Subject to flow control and ordering RNIC pushes pending writes to targeted region(s) Alternatively, NIC may simply opt to push all writes RNIC performs any necessary PM commit Possibly interrupting CPU in current architectures Future (highly desirable to avoid latency) perform via PCIe RNIC responds when durability is assured 13

STANDARDS DISCUSSION Broad IBTA agreement on RDMA Flush approach Discussion continues on semantic details Per-segment, per-region or per-connection? Implications on the API (SNIA NVMP TWG) Important to converge Ongoing discussion on non-posted write Provides write-after-flush semantic Useful for efficient log write, transactions, etc While avoiding pipeline bubbles PCI support for Flush from RDMA adapter To Memory, CPU, PCI Root, PM device, PCIe device, Avoids CPU interaction Supports strong data persistency model Possibly achievable without full-blown PCIe extensions Hints, magic platform register(s), etc Accelerate adoption with platform-specific support 14

REMOTE ACCESS FOR HA LADDER DIAGRAM Remote Optimized Flush App: SW PeerA: Host SW PeerANIC: RNic PeerBNIC: RNic PeerB: Host SW PeerBPM: PM Optimized Flush RDMAWrite (or) Async Flush + Drain 1 RDMAWrite RDMAWrite Flush RDMAWrite RDMAWrite RDMAWrite Write Write Flush 2 Flush Write Flush 3 15

EXAMPLE: LOG WRITER (FILESYSTEM) For (ever) { Write log record, Commit }, { Write log pointer (, Commit) } Latency is critical Log pointer cannot be placed until log record is successfully made durable Log pointer is the validity indicator for the log record Transaction model Log records are eventually retired, buffer is circular Protocol implications: Must wait for first commit (and possibly the second) Wait introduces a pipeline bubble very bad for throughput and overall latency Desire an ordering between Commit and second Write to avoid this Proposed solution: Non-posted write Special Write which executes like a Read Ordered with other non-posted operations (esp. Flush) Specific size and alignment restrictions Being discussed in IBTA Not yet expressed in SNIA NVMP interface 16

WRITES, FLUSH AND WRITE-AFTER-FLUSH Non-posted Write Hos t SW NIC NIC Host PMEM Merge? Flush 17

RDMA VERIFY After a RDMA Write + RDMA Flush, how does the initiator know the data is intact? Or in case of failure, which data is not intact? Reading data, signaling remote CPU inefficient and impractical Add integrity hashes to a new operation Or, possibly, piggybacked on Flush (which would mean only one protocol change) Note, not unlike SCSI T10 DIF Hashing implemented in RNIC or Library implementation Algorithm(s) to be negotiated by upper layers Which could be in Platform, e.g. storage device itself RNIC hardware/firmware, e.g. RNIC performs readback/integrity computation Other hardware on target platform, e.g. chipset, memory controller Software, e.g. target CPU Roughly mapped to SNIA NVMP TWG OptimizedFlushAndVerify() 18

WRITE, FLUSH AND VERIFY Host SW NIC NIC Host PMEM Merge? Flush Verify 19

PRIVACY Upper layers protect their send/receive messages today But RDMA direct transfers are not protected No RDMA encryption standard Desire to protect User Data with User Key Not global, machine or connection key! Rules out IPsec, TLS, DTLS Why not just use the on-disk crypto? Typically a block cipher, requiring block not byte access No integrity requires double computation Authenticated stream cipher (e.g. AES-CCM/GCM as used by SMB3) Provides wire privacy *and* integrity, efficiently (not at-rest still need RDMA Verify) Arbitrary number of bytes per transfer Shares cipher and keying with upper layer But, how to plumb key into RDMA NIC message processing? Enhance RDMA Memory Regions Which does not require RDMA protocol change! 20

RDMA WRITE ENCRYPTION Extend MR verb and NIC TPT to Host include key SW Handle = Register(PD, buffer, mode, key) Keys held by upper layer, user policy, and passed down to NIC NIC uses key when reading or writing each region NIC +nonce +nonce NIC Host PMEM Host SW * Klingon 21

CIPHER HOUSEKEEPING Authenticated ciphers typically employ nonces (e.g. AES-GCM) Same {key,nonce} pair used at each end to encrypt/decrypt each message Never reuse nonce for different payload! Upper layer must coordinate nonce usage with RDMA layer RDMA layer must consider when retrying NIC may derive nonce sequence from RDMA connection E.g. from RDMA msn. Not from the MR! Alternatively, prepend/append to data buffer Upper layer consumes nonce space, too Re-keying necessary when nonce space exhausted! Nonces are large (SMB3 employs 11 bytes for GCM), but require careful management and sharing with ULP Key management is upper layer responsibility, as it should be 22

RDMA QOS Upper layers have no trouble saturating even 100Gb networks with RDMA today Memory can sink writes at least this fast Networks will rapidly congest Rate control, fairness and QoS are required in the RDMA NIC Simplistically, bandwidth limiting More sophisticated approach desirable Classification/end-to-end QoS techniques SDC2014 presentation ( resources slide) Software-Defined Network techniques Generic Flow Table-based policy Existing support in many Enterprise-class NICs 23

WINDOWS PMEM SUPPORT 24

WINDOWS PERSISTENT MEMORY Persistent Memory is supported in Windows PMEM support is foundational in Windows going forward Support for JEDEC-defined NVDIMM-N devices available from Windows Server 2016 Windows 10 (Anniversary Update Fall 2016) PMDK is fully supported by Windows Code and binaries available on Github (Resources slide) 25

WINDOWS PMEM ACCESS DAX mode (at right) Direct Access app obtains direct access to PMEM via existing memory-mapping semantics Updates directly modify PMEM, Storage Stack not involved (Green data path) True device performance (no software overhead) Byte-Addressable NVM.PM.FILE Available since Windows 10 Anniversary Update / Server 2016 Block mode (at left) App continues to use traditional kernelmediated APIs (File and Block) Full compatibility with existing apps Improved performance via underlying hardware NVM.PM.VOLUME Available since Windows 10 Anniversary Update / Server 2016 Standard Block API User Mode Kernel Mode Bus Driver Block Mode Application Standard File API Application requests memorymapped file Disk Driver DirectAccess Application PM-Aware File System (NTFS - DAX) Enumerates NVDIMM DirectAccess Setup Path Memory Mapped Region Load/Store Operations DirectAccess Data Path PMEM-Aware (later slide) PMEM-aware open coded Native Rtl API (Windows 10 Spring 2017 and future Server) PMDK open source PMEM Load/Store Operations Memory Mapped Region 26

WINDOWS NATIVE PMEM APIS Available from both user and kernel modes Flush to durability in optimal fashion for each hardware architecture Supported in Windows 10 1703 (Spring 2017) and Windows Server (future) RtlGetNonVolatileToken https://msdn.microsoft.com/en-us/library/windows/hardware/mt800701.aspx RtlWriteNonVolatileMemory https://msdn.microsoft.com/en-us/library/windows/hardware/mt800702.aspx RtlFlushNonVolatileMemory/RtlFlushNonVolatileMemoryRanges https://msdn.microsoft.com/en-us/library/windows/hardware/mt800698.aspx https://msdn.microsoft.com/en-us/library/windows/hardware/mt800699.aspx RtlDrainNonVolatileFlush https://msdn.microsoft.com/en-us/library/windows/hardware/mt800697.aspx RtlFreeNonVolatileToken https://msdn.microsoft.com/en-us/library/windows/hardware/mt800700.aspx 27

USING DAX IN WINDOWS DAX Volume Creation DAX Volume Identification 28

IMPLICIT FLUSH CODE EXAMPLE Memory Mapping: 29

EXPLICIT FLUSH CODE EXAMPLE PVOID token = RtlGetNonVolatileToken(baseAddress, size, &token); for (;;) { *(int *)baseaddress = random(); // Write to PMEM RtlFlushNonVolatileMemory(token, baseaddress, sizeof(int), 0); baseaddress += sizeof(int); } RtlFreeNonVolatileToken(token); CloseHandle(hMapping); // flags=0 30

ASYNCHRONOUS FLUSH CODE EXAMPLE PVOID token = RtlGetNonVolatileToken(baseAddress, size, &token); for (;;) { *baseaddress = random(); // Write to PMEM RtlFlushNonVolatileMemory(token, baseaddress, sizeof(int), FLUSH_NV_MEMORY_IN_FLAG_NO_DRAIN ); baseaddress += sizeof(int); // do more stuff RtlDrainNonVolatileFlush(token); // Wait for prior Flush } RtlFreeNonVolatileToken(token); CloseHandle(hMapping); 31

EXAMPLE: GOING REMOTE SMB3 SMB3 RDMA and Push Mode SMB3 Server may implement DAX-mode direct mapping (2) Reduced server overhead RDMA Push/ Commit RDMA NIC RDMA R/W 2 Load/Store 3 Buffer Cache RDMA NIC may implement Flush extension (3) Enables zero-copy remote read/write/flush to DAX file Ultra-low latency and overhead 2, 3 can be enabled before RDMA extensions become available, with slight extra cost SMB3 SMB3 Server 1 Traditional i/o 2 DAX memcpy by SMB3 Server 3 Push Mode direct from RDMA NIC 1 I/O requests DAX Filesystem PMEM Direct file mapping 32

RESOURCES SNIA NVM Programming TWG: http://www.snia.org/forums/sssi/nvmp NVMP 1.2 https://www.snia.org/sites/default/files/sdc/2017/presentations/solid_state_stor_nvm_pm_nvdimm/ Talpey_Tom_Microsoft%20_SNIA_NVM_Programming_Model_V%201.2_and_Beyond.pdf SNIA Storage Developers: Remote Persistent Memory With Nothing But Net https://www.snia.org/sites/default/files/sdc/2017/presentations/solid_state_stor_nvm_pm_nvdimm/ Talpey_Tom_RemotePersistentMemory.pdf Low Latency Remote Storage A Full-stack View https://www.snia.org/sites/default/files/sdc/2016/presentations/persistent_memory/tom_talpey_low_l atency_remote_storage_a_full-stack_view.pdf Storage Quality of Service for Enterprise Workloads https://www.snia.org/sites/default/files/tomtalpey_storage_quality_service.pdf Windows PMEM Programming: https://msdn.microsoft.com/en-us/library/windows/hardware/ff553354.aspx https://github.com/pmem/pmdk https://github.com/pmem/pmdk/releases Open Fabrics Workshop: Remote Persistent Memory Access - Workload Scenarios and RDMA Semantics https://www.openfabrics.org/images/eventpresos/2017presentations/405_remotepm_ttalpey.pdf 33

14th ANNUAL WORKSHOP 2018 THANK YOU