Accessing NVM Locally and over RDMA Challenges and Opportunities

Similar documents
What You can Do with NVDIMMs. Rob Peglar President, Advanced Computation and Storage LLC

Windows Support for PM. Tom Talpey, Microsoft

Windows Support for PM. Tom Talpey, Microsoft

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

Flash Memory Summit Persistent Memory - NVDIMMs

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

APIs for Persistent Memory Programming

Capabilities and System Benefits Enabled by NVDIMM-N

Annual Update on Flash Memory for Non-Technologists

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

The SNIA NVM Programming Model. #OFADevWorkshop

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

Application Access to Persistent Memory The State of the Nation(s)!

Non-Volatile Memory Through Customized Key-Value Stores

Using NVDIMM under KVM. Applications of persistent memory in virtualization

SMB3 Extensions for Low Latency. Tom Talpey Microsoft May 12, 2016

Persistent Memory over Fabrics

Application Acceleration Beyond Flash Storage

NVMe SSDs with Persistent Memory Regions

Innovator, Disruptor or Laggard, Where will your storage applications live? Next generation storage

NVM Express Awakening a New Storage and Networking Titan Shaun Walsh G2M Research

Building the Most Efficient Machine Learning System

Impact on Application Development: SNIA NVM Programming Model in the Real World. Andy Rudoff pmem SW Architect, Intel

Persistent Memory, NVM Programming Model, and NVDIMMs. Presented at Storage Field Day June 15, 2017

NVMe over Universal RDMA Fabrics

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Gen-Z Memory-Driven Computing

Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

Toward a Memory-centric Architecture

VMware vsphere Virtualization of PMEM (PM) Richard A. Brunner, VMware

N V M e o v e r F a b r i c s -

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

RDMA Requirements for High Availability in the NVM Programming Model

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Sharing High-Performance Devices Across Multiple Virtual Machines

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Design Considerations When Implementing NVM

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Enterprise Architectures The Pace Accelerates Camberley Bates Managing Partner & Analyst

Storage Protocol Offload for Virtualized Environments Session 301-F

Building the Most Efficient Machine Learning System

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

PARAVIRTUAL RDMA DEVICE

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Microsoft SMB Looking Forward. Tom Talpey Microsoft

3D Xpoint Status and Forecast 2017

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

The Slow NVM (R)evolution in the context of Next Generation Postprocessing

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

The Future of High Performance Interconnects

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

SNIA s SSSI Solid State Storage Initiative. Jim Pappas Vice-Char, SNIA

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

NVM Express (NVMe) & Solid State Storage: What s Next? David L. Black, Ph.D. Distinguished Engineer, Office of the CTO

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Overcoming System Memory Challenges with Persistent Memory and NVDIMM-P

Persistent Memory: The Value to HPC and the Challenges

Planning For Persistent Memory In The Data Center. Sarah Jelinek/Intel Corporation

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Memory Management Strategies for Data Serving with RDMA

Programmable Solutions for Data Center Applications

New types of Memory, their support in Linux and how to use them with RDMA

NVMe Over Fabrics (NVMe-oF)

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

New Computational Approaches to Big Data

RoCE Accelerates Data Center Performance, Cost Efficiency, and Scalability

IO virtualization. Michael Kagan Mellanox Technologies

Paving the Way to the Non-Volatile Memory Frontier. PRESENTATION TITLE GOES HERE Doug Voigt HP

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Failure-atomic Synchronization-free Regions

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Transcription:

Accessing NVM Locally and over RDMA Challenges and Opportunities Wendy Elsasser Megan Grodowitz William Wang MSST - May 2018

Emerging NVM A wide variety of technologies with varied characteristics Address granularity Endurance Cost per bit Write latency Density Read latency DRAM STT-MRAM PCM ReRAM NAND Variable latency and tail distributions 2

Multiple system use-cases with unique challenges Faster Storage 1000x faster than NAND Denser Mem 10x denser than DRAM Persistent Mem Non-Volatile NVM DRAM NVM SSD DRAM NVM DRAM 3 Storage Filesystem bottlenecks Transformative Capacity/TCOadvantage Endurance Bandwidth Caching Persistency Ordering Point of Persistence

What about persistence? Managing ordering requirements Core- 1 Core- 2 Core- 3 Core- 4 L1 $ L1 $ L1 $ L1 $ DRAM LLC Persistent Memory (PM) Recovery Recovery can inspect the data-structures in PM to restore system to a consistent state Crash consistency (failure atomicity) is needed to ensure recovery can restore system to a consistent state Data move through volatile memories before they get written to PM Using CPU cache flushes and fence instructions Direct connect PMEM protocols (NVDIMM) include explicit FLUSH semantics 4

Example: Add a node to a linked list with PMEM root headp 3 Publish & Persist 3 2 Node Initialize & Persist newnode nextp 1 PM Allocate 5

Persistent Memory Programming Models Native Persistence Library Persistence Atomic Library Persistence Durable TXs pt->x = 1; pt->y = 1; dccvap(&pt->x) dccvap(&pt->y) dsb flag=1; dccvap(&flag) dsb pt->x = 1; pt->y = 1; pmem_persist(&pt, sizeof(pt)) flag = 1; pmem_persist(&flag, sizeof(flag)) TX_BEGIN{ pt->x = 1; pt->y = 1; } TX_END createpersistundolog (L) mutatedata (M) persistdata (P) commitlog (C) 6 Programming simpler, overhead higher

PMDK (Persistent Memory Development Kit) Formally NVML, pmem libraries PMDK provides transactional APIs for persistent memory programming libpmemobj transactional APIs Use fine-grained logging and cache flushes Works on 64-bit Linux, Windows and 64-bit FreeBSD Ref: pmem.io 7

Flushing, logging and fencing overheads i7-6600u PMDK-v1.3 Normalized throughput 1.0 0.5 71% 68% 83% 72% 63% 96% 98% 95% 81% 37% 39% Moving NVM from storage to local, byte addressable memory greatly improves performance But... overheads still exist to maintain a point of persistency. Can be minimized with: Architectural optimizations Software optimizations Hardware acceleration 0.0 Log on Flush on Fence on All on map_insert map_remove Redis_SET Baseline: PMDK without flushing/fencing and logging on 8 Workloads: Map insert/remove, Redis Set. Implemented with NVML v1.3 libpmemobj transactions Platform: Intel i7-6600u with CLFLUSHOPT, single node with local DRAM

Fully incorporating NVM into your system Numerous attachment points for the varied use cases Gen-Z, Infiniband, RoCE PCIe, Etc. Emerging Emerging NVM Emerging NVM NVM NVM High capacity, scalable Local and remote / distributed NVM both of interest 9 SoC (Procesor) NVMe Storage NVM Fast storage, SSD caching Addressed as fast IO DRAM DIMM DDRx NVDIMM-P Low latency, moderately high capacity PMEM - Directly addressed NVM Large capacity and/or persistent memory New interfaces take advantage of byte addressable NVM How can we leverage RDMA for PMEM?

Remote Direct Memory Access 10 What? How? Why? When? Who / Where? Direct access to memory on a remote system without OS involvement Zero-copy networking; read/write from main memory with network adaptor Lower latency, higher bandwidth communication between distributed processes Late 90 s: Virtual Interface Architecture tried to standardize zero-copy networking Mid-late 00 s: First Infiniband implementations stable and mature. Today (2018): Still be described as a new technology Well, supercomputers, but also Chelsio Terminator 5 & 6 iwarp adapters GlusterFS internetwork filesystem Intel Xeon Scalable processors and Platform Controller Hub Mellanox ConnectX family of network adapters and InfiniBand switches Microsoft Windows Server (2012 and higher) via SMB Direct supports RDMA-capable network adapters, Hyper-V virtual switch and the Cognitive Toolkit. Apache Hadoop and Apache Spark big data analysis Baidu Paddle (PArallel Distributed Deep LEarning) platform Broadcom and Emulex adapters Caffe deep learning framework Cavium FastLinQ 45000/41000 Series Ethernet NICs Ceph object storage platform ChainerMN Python-based deep learning open source framework Nutanix's upcoming NX-9030 NVM Express flash appliance is said to support RDMA. Nvidia DGX-1 deep learning appliance Oracle Solaris 11 and higher for NFS over RDMA TensorFlow open source software library for machine intelligence Torch scientific computing framework VMware ESXi

RDMA programming Often abstracted underneath some other library layer MPI and other HPC communication libraries Lustre, NFS_RDMA and other I/O libraries SDP, rsockets, or other socket type interface Explicit programming of RDMA uses Verbs Verbs is not actually an API, but is instead a functional description of RDMA libibverbs is the standard Linux verbs implementation API APIs for verbs register byte array contiguous memory regions to make them available for remote access Same API for all RDMA enabled networks Infiniband RDMA Over Converged Ethernet (RoCE) Internet Wide Area RDMA Protocol (iwarp) NVM API s could leverage old ideas - E.g. Memory mapped files - Add a couple of more things like - Allocation, Flush - Great for adaption but must also ensure functionality and performance with new features and limitations 11

RDMA, PMEM, and filesystems current state Block device APIs already support concepts like flushing and persistence E.g. fflush() an IO stream means the data will be there after power outage Linux PMEM drivers are available for NVDIMM (byte addressable) support Fundamental NVM value - Data persistence Byte level access with DAX to bypass the page cache and get memory like speeds Three device modes for NVDIMM namespaces include: Fundamental PMEM value Memory mode: DAX byte level access + DMA support - Byte Addressable NVM But there is a small problem With direct PMEM access, pinned RDMA pages may be corrupted when the file is truncated Patch is available (*https://patchwork.kernel.org/patch/10028887/) 12

Where can we go from here? Emerging NVM is creating opportunities to redefine the memory sub-system Will still have slow, cheap storage, but will have fast, distributed PMEM in front of it FLUSH capability required for persistency across power-fail events Linux PMEM drivers currently available and NVDIMM-P natively supports FLUSH capabilities Optimizations possible to reduce overheads for persistency Must also ensure persistent capabilities work with RDMA Let s start with a bottom-up approach, leveraging existing technologies and developing new APIs Incorporate into distributed applications (work-flow model) to gain performance benefits Data sharing and synchronization in PMEM 13

Thank You! Danke! Merci!!! Gracias! Kiitos! 감사합니다 ध"यव द 14