BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Similar documents
iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

EXPERIENCES WITH NVME OVER FABRICS

InfiniBand Networked Flash Storage

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

NVMe Direct. Next-Generation Offload Technology. White Paper

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Application Acceleration Beyond Flash Storage

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Ziye Yang. NPG, DCG, Intel

Storage Protocol Offload for Virtualized Environments Session 301-F

Advanced Computer Networks. End Host Optimization

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

PARAVIRTUAL RDMA DEVICE

Multifunction Networking Adapters

RDMA in Embedded Fabrics

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

OFED Storage Protocols

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

RoCE vs. iwarp Competitive Analysis

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

SRP Update. Bart Van Assche,

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Generic RDMA Enablement in Linux

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Annual Update on Flash Memory for Non-Technologists

NVM Express over Fabrics Storage Solutions for Real-time Analytics

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

NVMe over Universal RDMA Fabrics

Infiniband and RDMA Technology. Doug Ledford

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Configuring iscsi in a VMware ESX Server 3 Environment B E S T P R A C T I C E S

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Low latency and high throughput storage access

Memory Management Strategies for Data Serving with RDMA

The NE010 iwarp Adapter

Introduction to Infiniband

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Birds of a Feather Presentation

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

Introduction to High-Speed InfiniBand Interconnect

Performance Implications Libiscsi RDMA support

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Building a High IOPS Flash Array: A Software-Defined Approach

Open Fabrics Workshop 2013

Why NVMe/TCP is the better choice for your Data Center

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

InfiniBand * Access Layer Programming Interface

DRBD SDS. Open Source Software defined Storage for Block IO - Appliances and Cloud Philipp Reisner. Flash Memory Summit 2016 Santa Clara, CA 1

URDMA: RDMA VERBS OVER DPDK

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

The advantages of architecting an open iscsi SAN

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Containing RDMA and High Performance Computing

iscsi Technology: A Convergence of Networking and Storage

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

An Approach for Implementing NVMeOF based Solutions

Windows OpenFabrics (WinOF) Update

RoGUE: RDMA over Generic Unconverged Ethernet

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Designing Next Generation FS for NVMe and NVMe-oF

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

G2M Research Fall 2017 NVMe Market Sizing Webinar

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Welcome to the IBTA Fall Webinar Series

by Brian Hausauer, Chief Architect, NetEffect, Inc

- Knowledge of basic computer architecture and organization, ECE 445

Fibre Channel vs. iscsi. January 31, 2018

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Storage Area Network (SAN) Training Presentation. July 2007 IBM PC CLUB Jose Medeiros Storage Systems Engineer MCP+I, MCSE, NT4 MCT

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

NVMe Over Fabrics (NVMe-oF)

Best Practices for Deployments using DCB and RoCE

Software-defined Shared Application Acceleration

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Asynchronous Peer-to-Peer Device Communication

HP Cluster Interconnects: The Next 5 Years

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

SurFS Product Description

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

CLOUD COMPUTING IT0530. G.JEYA BHARATHI Asst.Prof.(O.G) Department of IT SRM University

Transcription:

3rd ANNUAL STORAGE DEVELOPER CONFERENCE 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, Tej Parkash, Lokesh Arora, Storage Engineering [May 26th, 2017 ]

AGENDA Introduction Setting the Context (SVC as Storage Virtualizer) SVC Software Architecture overview iser: Confluence of iscsi and RDMA Performance: iser v/s Fibre Channel Challenges Queue Pair states RDMA disconnect behavior RDMA connection management Large DMA memory allocation Query Device List Conclusion 2

INTRODUCTION

SETTING THE CONTEXT (SVC AS STORAGE VIRTUALIZER) SVC pools heterogenous storage and virtualizes it Hosts Host Host Host Host VDisks 3 VDisks 4 for the host iser Target for Host Host SAN VDisks 1 iser Initiator for Storage Controller (FLASH or VDisks 2 Lodeston e SVC / Lodeston e SVC Lodeston e SVC Lodeston e SVC HDD) SVC Storage Application Clustered over iser for high availability Supports both RoCE and iwarp Device SAN Controller LUNs RAID Ctrl RAID Ctrl RAID Ctrl Supports 10/25/40/50/100G bandwidths SVC Virtual SAN 4 RAID Ctrl

SVC SOFTWARE ARCHITECTURE OVERVIEW SVC Storage Virtualization Application SCSI Initiator SCSI Target SVC application runs in user space iser and iscsi drivers in kernel space Lockless architecture (Per CPU port handling) iscsi Driver Polled mode IO handling iser Initiator C R S Q Q Q Supports RoCE and iwarp Vendor Independent (Mellanox, Chelsio, Qlogic, iser Target C R S Q Q Q OFED IB Verbs Broadcom, Intel etc.) Dependence on OFED kernel IB Verbs RoCE Adapter 5 iwarp Adapter

iser: Confluence of iscsi and RDMA iser is iscsi with a RDMA data path Performance: Low Latency, Low CPU utilization, High Bandwidth High Bandwidth: 25Gb, 50Gb, 100Gb and beyond No new administration! Leverages existing knowledge of iscsi administration & ecosystem on servers and storage

PERFORMANCE: iser vs FIBRE CHANNEL

CHALLENGES

QUEUE PAIR STATES Goal Control number of retries and retry timeout during network outage Actual behavior State transition differs across RoCE and iwarp e.g RoCE does not support SQD state Expectation Transition QP to SQD state to modify QP attributes ib_modify_qp() must transition QP states as per state diagram shown All state transition must be supported by both RoCE and iwarp Work Around Referenced from book Linu Kernel Networking - Implementation and Theor No work around found Exploring vendor specific possibilities 9

RDMA DISCONNECT BEHAVIOR Goal/Observation QP cannot be freed before RDMA_CM_EVENT_DISCONNECTED event is received There is no control over the timeout period for this event SVC Application Actual behavior Link down on peer system causes DISCONNECT event to be received after long delay RoCE: ~100 Sec iwarp: ~70 Sec There is no standard mechanism (verb) to control these timeouts Fabric Expectation RDMA disconnect event must exhibit uniform timeout across RoCE and iwarp Timeout period for disconnect must be configurable Peer host/target Work Around Evaluating vendor specific mechanism to tune CM timeout 1 0

RDMA CONNECTION MANAGEMENT Goal SVC Storage Virtualization Application SCSI Initiator SCSI Target Polled mode data path and Connection Management Current mechanism No mechanism to poll for CM events. All RDMA CM events are interrupt driven Current implementation involves deferring CM events to Linux workqueues Application has no control over which CPU to POLL CM events from iscsi Driver iser Initiator C R S Q Q Q Expectation Queues for CM event handling iser Target R S QQ OFED IB Verbs Work Around Usage of locks add to IO latency RoCE Adapter 1 1 iwarp Adapter

LARGE DMA MEMORY ALLOCATION Observation Allocation of large chunks DMAable memory during session establishment fails SVC reserves majority of physical memory during system initialization for caching Current mechanism IB Verbs use kmalloc() to allocate DMAable memory for all the Type Elements Size Total Size(KB) SQ 2064 88 ~177KB RQ 2064 32 ~64KB CQ 2064 32 ~64KB queues Expectation IB Verbs must provide a means to allocate DMA-able memory from pre-allocated memory pool. e.g. in the following ib_alloc_cq() ib_create_qp() Work Around Solutions Modified iwarp and RoCE driver to use pre-allocated memory pools from SVC 10 Single Connection Memory requirement in Linux OFED Stack = ~297KB

QUERY DEVICE LIST Observation No kernel verb to find list of rdma devices on system until RDMA session is established Per device resource allocation during kernel module initialization Current mechanism RDMA device available only after connection request is established by CM event handler Expectation Need verb equivalent to ibv_get_device_list() in kernel IB Verbs Work Around Complicates per port resource allocation during initialization 13

CONCLUSION Initial indications of IO performance compared to FC excellent! iser presents an opportunity for high performance Flash based Ethernet data center Error recovery and handling is still evolving Mass adoption by storage vendors requires more work in OFED IB Verbs is not completely protocol independent Proper documentation of RoCE vs iwarp specific difference Definitive resource allocation timeout values (R_A_TOV equivalent in FC) Same requirements applicable to NVMef 14

3RD ANNUAL STORAGE DEVELOPER CONFERENCE 2017 THANK YOU subhojit.roy@in.ibm.com, tprakash@in.ibm.com, loharora@in.ibm.com [May 26th, 2017 ]