Accelerating Storage with RDMA Max Gurtovoy Mellanox Technologies

Similar documents
Persistent Memory what developers need to know Mark Carlson Co-chair SNIA Technical Council Toshiba

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Multidimensional Testing

IP Multicast Simulation in OPNET

Configuring RSVP-ATM QoS Interworking

! "# $ $ %&&' Thanks and enjoy! JFK/KWR. All material copyright J.F Kurose and K.W. Ross, All Rights Reserved 5: DataLink Layer 5-1

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson

The Processor: Improving Performance Data Hazards

UCB CS61C : Machine Structures

CAM I/O Scheduler. Netflix, Inc. AsiaBSDCon 2015

Any modern computer system will incorporate (at least) two levels of storage:

You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011

Module 6 STILL IMAGE COMPRESSION STANDARDS

COSC 6385 Computer Architecture. - Pipelining

User Visible Registers. CPU Structure and Function Ch 11. General CPU Organization (4) Control and Status Registers (5) Register Organisation (4)

COEN-4730 Computer Architecture Lecture 2 Review of Instruction Sets and Pipelines

CENG 3420 Computer Organization and Design. Lecture 07: MIPS Processor - II. Bei Yu

1.3 Multiplexing, Time-Switching, Point-to-Point versus Buses

A Novel Parallel Deadlock Detection Algorithm and Architecture

Lecture #22 Pipelining II, Cache I

A Memory Efficient Array Architecture for Real-Time Motion Estimation

Using SPEC SFS with the SNIA Emerald Program for EPA Energy Star Data Center Storage Program Vernon Miller IBM Nick Principe Dell EMC

Computer Architecture. Pipelining and Instruction Level Parallelism An Introduction. Outline of This Lecture

CENG 3420 Lecture 07: Pipeline

Computer Science 141 Computing Hardware

The Internet Ecosystem and Evolution

DYNAMIC STORAGE ALLOCATION. Hanan Samet

Lecture Topics ECE 341. Lecture # 12. Control Signals. Control Signals for Datapath. Basic Processing Unit. Pipelining

Pipes, connections, channels and multiplexors

RBAC Tutorial. Brad Spengler Open Source Security, Inc. Locaweb

THE THETA BLOCKCHAIN

Prioritized Traffic Recovery over GMPLS Networks

dc - Linux Command Dc may be invoked with the following command-line options: -V --version Print out the version of dc

GCC-AVR Inline Assembler Cookbook Version 1.2

IS-IS Protocol Hardware Implementation for VPN Solutions

Advances in Automobile Engineering

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

MULTI-AGENT SYSTEM FOR NETWORK ATTACK DETECTION

DYNAMIC STORAGE ALLOCATION. Hanan Samet

An Improved Resource Reservation Protocol

Administrivia. CMSC 411 Computer Systems Architecture Lecture 5. Data Hazard Even with Forwarding Figure A.9, Page A-20

Modeling a shared medium access node with QoS distinction

EXPERIENCES WITH NVME OVER FABRICS

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

SIMOCODE pro. Motor Management and Control Devices. SIMOCODE pro for Modbus RTU. Answers for industry. Edition 04/2015

Chapter 4 (Part III) The Processor: Datapath and Control (Pipeline Hazards)

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

IP Network Design by Modified Branch Exchange Method

Design considerations for an educational time-sharing system

EE 6900: Interconnection Networks for HPC Systems Fall 2016

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Review: Joy s Law in ManyCore world. Bell s Law new class per decade

i-pcgrid Workshop 2016 April 1 st 2016 San Francisco, CA

CS 2461: Computer Architecture 1 Program performance and High Performance Processors

Introduction To Pipelining. Chapter Pipelining1 1

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

GARBAGE COLLECTION METHODS. Hanan Samet

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

Conversion Functions for Symmetric Key Ciphers

Review from last lecture

Monitors. Lecture 6. A Typical Monitor State. wait(c) Signal and Continue. Signal and What Happens Next?

Communication module System Manual Part 9

ECE331: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue

Getting Started PMW-EX1/PMW-EX3. 1 Rotate the grip with the RELEASE button pressed. Overview. Connecting the Computer and PMW-EX1/EX3

The Exascale Architecture

CSE4201. Computer Architecture

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Exploring non-typical memcache architectures for decreased latency and distributed network usage.

High performance CUDA based CNN image processor

RT-WLAN: A Soft Real-Time Extension to the ORiNOCO Linux Device Driver

APPLICATION OF STRUCTURED QUEUING NETWORKS IN QOS ESTIMITION OF TELECOMMUNICATION SERVICE

# $!$ %&&' Thanks and enjoy! JFK/KWR. All material copyright J.F Kurose and K.W. Ross, All Rights Reserved

Automatically Testing Interacting Software Components

Ziye Yang. NPG, DCG, Intel

The Java Virtual Machine. Compiler construction The structure of a frame. JVM stacks. Lecture 2

Communication vs Distributed Computation: an alternative trade-off curve

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Advanced Computer Networks. End Host Optimization

Simulation and Performance Evaluation of Network on Chip Architectures and Algorithms using CINSIM

Wormhole Detection and Prevention in MANETs

Hierarchically Clustered P2P Streaming System

Coded Distributed Computing

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Toward the road to NGI: IPv6 multicast operation of IPv6-CJ backbone

Hierarchical Peer-to-peer Systems

Detection and Recognition of Alert Traffic Signs

MAC Protocol for Supporting QoS in All-IP HiperLAN2

On using circuit-switched networks for file transfers

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism

Integrated Monitoring and Control System imac2 Controller Modbus TCP/IP Communications Manual

Reference Design: NVMe-oF JBOF

N V M e o v e r F a b r i c s -

An Efficient Handover Mechanism Using the General Switch Management Protocol on a Multi-Protocol Label Switching Network

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

DPICO: A High Speed Deep Packet Inspection Engine Using Compact Finite Automata

Transcription:

Acceleating Stoage with RDMA Max Gutovoy Mellanox Technologies 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 1

What is RDMA? Remote Diect Memoy Access - povides the ability to pefom a diect memoy access (DMA) fom one compute into to anothe without involving eithe one's OS/CPU. Was ceated in 1999 (implementations: infiniband, RoCE, iwarp) Main chaacteistics: High Bandwidth Low latency Zeo copy (CPU offload) Hadwae based data tansfes Kenel bypass Diect access to HW fo use-level applications QOS Asynchonous tansactions 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 2

RDMA pimitives QP (Queue-Pai) send & ecv queues, with vaious tanspot sevices, used fo posting wok equests to the HW: RC (Reliable Connected) ~=TCP UD (Uneliable Datagam) ~= UDP UC (Uneliable Connected) RD (Reliable Datagam) defined by spec but no yet implemented CQ ( Queue) used fo epoting wok equests completions to the host MR (Memoy Region) Descibes a memoy aea, with the elevant pemissions, accessible fo RMDA fom the device. PD (Potection Domain) povides an association between QPs/MRs/MWs fo enabling and contolling HCA access to host memoy. Pogamming Model Vebs 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 3

RDMA opeations Messaging: RECV: post a buffe fo incoming data SEND: send a buffe to a emote pee (who posted a RECV buffe fo it in advance) REG_MR: memoy egistation fo RDMA opeations One-sided: RDMA_WRITE: copy a local buffe (descibed by MR-L) to a emote buffe (MR-R) RDMA_READ: copy a emote buffe (descibed by MR-R) to a local buffe (MR-L) 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 4

Memoy egistation So why we need to egiste memoy? Avoid data couption Potect fom unauthoized access Map the addesses to DMA language (PCI space) 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 5

Use Fast Memoy Registation Memoy egistation is a heavy opeation (allocations, pinning, tanslation, FW commands ) In the kenel (iser/srp/nvme-of ) we always eceive the buffe fom the use. Use allocate a buffe Use open a file (block device o file system) Use call syscall ead/wite(buffe) à the ULP sees this as a bio o as an sg list. - Pinning the buffe was done by the block laye (no need to take cae of data couption) One should use a special wok equest (WR) to make it fast Use pe-allocated MR Only DMA map the SG list and update the HW memoy management tables - Using ib_sge object that epesents a vitually contiguous buffe using (key, addes, length) tuple 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 6

Why Should We Cae About RDMA? Because Faste Stoage Needs a Faste Netwok (not only in HPC)!!! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 7

Vaiety of RDMA Stoage Potocols 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 8

Potocol Deep Dive NVMe/NVMe-oF Shae NVMe SSDs with multiple seves Bette utilization, capacity, ack space, powe Scalability management NVMe ove Fabics standad Vesion 1.0 completed in June 2016 High pefomance access to emote SSD (not only SSD) RDMA potocol is pat of the standad (e.g. keyed SGLs) Also FC and TCP (in pogess) 9 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 9

NVMe-oF Exchange Model 10 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 10

NVMe and NVMe-oF/RDMA Fit Togethe Well Netwok 11 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 11

Example: NVMe-oF Potocol (Wite) Host Registe Memoy (get MR) Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and keyed SGL. Subsystem Upon RCV Allocate Memoy fo Data Post RDMA READ to fetch data Upon READ Post command to backing stoe Upon SSD completion Send NVMe-oF Response Capsule (RC) Fee memoy Upon SEND Fee CC and completion esouces Fee send buffe Fee data buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack RDMA Read Read esponse fist Read esponse last Send Response Capsule Ack RNIC Post Send (Read data) Post Send (RC) NVMe Taget Allocate memoy fo data Registe to the RNIC Post NVMe command Wait fo completion Fee allocated memoy Fee Receive buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 12

Example: NVMe-oF Potocol (Read) Host Registe memoy (get MR) Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and keyed SGL. Subsystem Upon RCV Allocate Memoy fo Data Post command to backing stoe Upon SSD completion Post RDMA Wite to wite data back to host Send NVMe-oF Response Capsule (RC) Upon SEND Fee memoy Fee CC and completion esouces Fee send buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack Wite fist Wite last Ack Send Response Capsule Ack RNIC NVMe Taget Post Send (Wite data) Post Send (RC) Post NVMe command Wait fo completion Fee eceive buffe Fee allocated buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 13

Example: NVMe-oF Potocol (Wite IN-Capsule) Host Post SEND caying Command Capsule (CC) that contains SQE (Submission Queue Enty) and data. Useful fo small IO (Cuently up to 4k) Subsystem Upon RCV Allocate Memoy fo Data Upon SSD completion Send NVMe-oF Response Capsule (RC) Fee memoy Upon SEND Fee RC and completion esouces Fee send buffe NVMe Initiato Post Send (CC) RNIC Send Command Capsule Ack Send Response Capsule Ack RNIC Post Send (RC) NVMe Taget Post NVMe command Wait fo completion Fee eceive buffe Fee send buffe 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 14

Challenges?! Pefomance Same as DAS Reduce memoy foot pint Shae esouces Scale Data is gowing We must have a ulta fast netwok Save $$$ Build systems with cheape CPU/HW Save CPU cycles Offload data path by HW High availability multipathing 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 15

NVMe-oF/RDMA has Geat Pefomance! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 16

Can we do bette? Yes we can!! Cuently WIP in Linux Inteupt/completion modeation (AKA coalescing): A technique in which events would nomally tigge a HW inteupt ae held back, eithe until a cetain amount of wok is pending, o a timeout time tigges Registe non contiguous buffe using indiect MR The use can povide an iovec whee each enty has its own length We can t assume use buffes consists of full pages We don t want the block laye to use bounce buffes save CPU cycles Use HW that suppots indiection in MM table 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 17

ConnectX-4 (and above) devices suppots indiection Implemented in iser SRP/NVMe-oF patches submitted Use IB_MR_TYPE_SG_GAPS Please Ty it!! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 18

Reducing Memoy foot pint by using SRQs SRQ stands fo Shaed Receive Queue QPs/Connections ae cheap, Receive buffes ae not! Solution: Shae eceive buffeing esouces between QPs Accoding to the paallelism equied by the application Locality of completions scalability NVMe-oF implementation today uses 1 SRQ pe HCA Lock contention in the data path No paallelism Bette to use SRQ pe coe o pe completion vecto (MSI-X) We have submitted patches to fix pefomance in Linux please ty! 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 19

Save CPU by using NVMe-oF Taget Offload NVMe-oF is built on top of RDMA Tanspot communication in hadwae NVMe-oF taget offload enable the NVMe hosts to access the emote NVMe devices w/o any CPU pocessing By offloading the entie NVMe-oF data path Encap/Decap NVMe-oF <-> NVMe is done by the adapte with 0% CPU CPU is available fo othe applications Easy configuation: echo 1 >.../subsystems/<subsys>/att_offload Admin opeations ae maintained in softwae IOPs with 0% CPU (512B IO ead) Connectx-5 1.0-1.2 MIOPs Bluefield SoC 7.5 MIOPs Upsteam submission TBD Cuently available in MLNX_OFED package Linux fok is available: https://github.com/mellanox/nvmeof-p2p/ Save $$$ - NVMe-oF taget systems can use cheape CPUs Host Root Complex and Memoy Subsystem NVMe IO NVMe ove Fabics Taget Offload RDMA Tanspot RNIC Netwok Admin 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 20

NVMe-oF Taget non-offload data path 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 21

NVMe-oF Taget offload data path 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 22

RDMA Block based stoage potocols in Linux Featue NVMe-oF iser SRP Fast memoy egistation V V V Indiect memoy egistation WIP V WIP SRQ V V SRQ pe coe WIP Remote Mkey invalidation V V Block MQ V V RoCE suppot V V WIP Use space tools nvmecli/nvmetcli iscsiadm/tagetcli sp_daemon/tagetcli High availability dm-multipath/nvme-multipath dm-multipath dm-multipath T10-PI V Use space open souce taget SPDK TGT 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 23

Thanks! maxg@mellanox.com 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 24