Performance Implications Libiscsi RDMA support

Similar documents
iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

InfiniBand Networked Flash Storage

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

Application Acceleration Beyond Flash Storage

EXPERIENCES WITH NVME OVER FABRICS

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Multifunction Networking Adapters

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Storage Protocol Offload for Virtualized Environments Session 301-F

Memory Management Strategies for Data Serving with RDMA

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Contiguous memory allocation in Linux user-space

The NE010 iwarp Adapter

IO virtualization. Michael Kagan Mellanox Technologies

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

RoCE vs. iwarp Competitive Analysis

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

N V M e o v e r F a b r i c s -

Open Source support for OSD

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

NVMe over Universal RDMA Fabrics

OFED Storage Protocols

Networking at the Speed of Light

PARAVIRTUAL RDMA DEVICE

Ziye Yang. NPG, DCG, Intel

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

Sun N1: Storage Virtualization and Oracle

Creating an agile infrastructure with Virtualized I/O

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

HP Cluster Interconnects: The Next 5 Years

Introduction to Infiniband

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

NVMe Direct. Next-Generation Offload Technology. White Paper

The Future of Interconnect Technology

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

CERN openlab Summer 2006: Networking Overview

NFS/RDMA Next Steps. Chuck Lever Oracle

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

jverbs: Java/OFED Integration for the Cloud

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

The Exascale Architecture

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Secure Containers with EPT Isolation

NVMe over Fabrics support in Linux Christoph Hellwig Sagi Grimberg

Building a High IOPS Flash Array: A Software-Defined Approach

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Advanced Computer Networks. End Host Optimization

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Why NVMe/TCP is the better choice for your Data Center

RDMA on vsphere: Update and Future Directions

by Brian Hausauer, Chief Architect, NetEffect, Inc

Challenges of High-IOPS Enterprise-level NVMeoF-based All Flash Array From the Viewpoint of Software Vendor

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Solid Access Technologies, LLC

Everything You Wanted To Know About Storage: Part Teal The Buffering Pod

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

What a Long Strange Trip It s Been: Moving RDMA into Broad Data Center Deployments

Reference Design: NVMe-oF JBOF

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Containing RDMA and High Performance Computing

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Update on Windows Persistent Memory Support Neal Christiansen Microsoft

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

What is RDMA? An Introduction to Networking Acceleration Technologies

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Outrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University

Birds of a Feather Presentation

No Tradeoff Low Latency + High Efficiency

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

NVMe Over Fabrics (NVMe-oF)

Scalable I/O A Well-Architected Way to Do Scalable, Secure and Virtualized I/O

RoCE Update. Liran Liss, Mellanox Technologies March,

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

From server-side to host-side:

Mark Falco Oracle Coherence Development

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Agenda How DPDK can be used for your Application DPDK Ecosystem boosting your Development Meet the Community Challenges

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Generic Buffer Sharing Mechanism for Mediated Devices

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Infiniband and RDMA Technology. Doug Ledford

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Generic RDMA Enablement in Linux

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Transcription:

Performance Implications Libiscsi RDMA support Roy Shterman Software Engineer, Mellanox Sagi Grimberg Principal architect, Lightbits labs Shlomo Greenberg Phd. Electricity and computer department Ben-Gurion University, Israel

Agenda Introduction to Libiscsi Introduction to iser Libiscsi/iSER implementation The memory Challenge in user-space RDMA Performance results Future work 2

What is Libiscsi? iscsi initiator user-space implementation. High performance non-blocking async API. Mature. Permissive license (GPL). Portable, OS independent. Fully integrated in Qemu. Written and maintained by Ronnie Sahlberg [https://github.com/sahlberg/libiscsi] 3

Why Libiscsi? Originally developed to provide built-in iscsi client side support for KVM/QEMU. Process private Logical Units (LUNs) without the need to have root permissions. Since, grew iscsi/scsi compliance test-suits. 4

iscsi Extensions for RDMA (iser) Part of IETF RFC-7147 Transport layer iser or iscsi/tcp are transparent to the user. 5

iser benefits Zero-Copy CPU offload Fabric reliability High IOPs, Low latency Inherits iscsi management Fabric/Hardware consolidation InfiniBand and/or Ethernet (RoCE/iWARP) 6

iser Read command flow SCSI Reads Initiator send Protocol Data Unit with encapsulated SCSI read to target. Target writes the data into Initiator buffers with RDMA_WRITE command. Target initiate Response to the Initiator that will complete the SCSI 7 command.

iser Write command flow SCSI Writes Initiator send Protocol Data Unit with encapsulated SCSI write to target (can contain also inline data to improve latency). Target reads the data from initiator buffers with RDMA_READ commands. Target initiate Response to the Initiator that will complete the SCSI command. 8

Libiscsi iser implementation Transparent integration. User-space networking (kernel bypass). High performance. Separation of data and control plane. Reduce latency by using non-blocking fd polling. 9

Libiscsi stack modification Layered the stack Centralized transport specific code Added a nice transport abstraction API Plugged in iser typedef struct iscsi_transport { int (*connect)(struct iscsi_context *iscsi, union socket_address *sa, int ai_family); int (*queue_pdu)(struct iscsi_context *iscsi, struct iscsi_pdu *pdu); struct iscsi_pdu* (*new_pdu)(struct iscsi_context *iscsi, size_t size); int (*disconnect)(struct iscsi_context *iscsi); void (*free_pdu)(struct iscsi_context *iscsi, struct iscsi_pdu *pdu); int (*service)(struct iscsi_context *iscsi, int revents); int (*get_fd)(struct iscsi_context *iscsi); int (*which_events)(struct iscsi_context *iscsi); } iscsi_transport; 10

QEMU iser support Qemu iscsi block driver needed some modifications to support iser. Move polling logic to the transport layer. Pass IO vectors to the transport stack. Work in progress should be available in the next few weeks. Also through libvirt! 11

Experiments and results Performance measured with Mellanox ConnectX4 on both initiator and target. Target side was TGT user-space iscsi target with RAM storage devices. IO generator was FIO (Flexible I/O tester). Each guest with single CPU core and single FIO process. Comparison against iscsi/tcp and block device pass-through of iser devices. 12

IOPS vs I/O depth 200000 180000 160000 140000 120000 IOPS 100000 80000 60000 40000 20000 0 0 20 40 60 80 100 120 140 I/O Depth iser Libiscsi TCP Libiscsi iser kernel PT 13

Bandwidth vs Block size 6000000 5000000 4000000 Bandwidth (KB/s) 3000000 2000000 1000000 0 0 20 40 60 80 100 120 140 Block Size (K) iser Libiscsi TCP Libiscsi iser kernel PT 14

8000 Latency vs I/O depth 7000 6000 5000 Latency (us) 4000 3000 2000 1000 0 0 20 40 60 80 100 120 140 I/O Depth iser Libiscsi TCP Libiscsi iser PT Latency 15

Latency vs Block Size 500 450 400 350 300 Latency (us) 250 200 150 100 50 0 1k 2k 4k 8k 16k 32k 64k 128k Block Size iser Libiscsi TCP Libiscsi iser kernel PT 16

Bandwidth across multiple VMs 3000000 2500000 2000000 Bandwidth (KB/s) 1500000 1000000 500000 0 1 2 3 4 num of VMs iser Libiscsi TCP Libiscsi iser kernel PT 17

IOPS across multiple VMs 800000 700000 600000 500000 IOPS 400000 300000 200000 100000 0 1 2 3 4 num of VMs iser Libiscsi TCP Libiscsi iser kernel PT 18

RDMA Memory registration In order to allow remote access the application needs to map the buffer with remote access permissions. Mapping operation is slow and not suite-able for the data-plane. Applications usually preregister all the buffers intended for networking and RDMA 19

Memory registration in Mid-layers Mid-layers often don't own the buffers but rather receives them from the application. Examples: OpenMPI, SHMEM and Libiscsi/iSER Memory registration for each data-transfer is not acceptable. 20

Possible solutions 1) Pre-register the entire application space. 2) Modify applications to use Mid-layer buffers. 3) Pin-down Cache: Register and cache mappings on the fly. 4) Page-able RDMA (ODP): Let the device and the kernel handle IO page-faults 21

RDMA paging - ODP RDMA devices can supports IO page-faults App can register huge virtual memory region (even entire memory space). HW and kernel handle page-faults and page invalidations If locality is good enough, performance penalty is amortized. Not bounded to physical memory. 22

iser with ODP and memory windows iser can leverage ODP for a more efficient data-path But, cannot map non-io related memory for remote access. Solution: Open a memory window on a page-able memory region (fast operation can be used in the data-path). ODP support for memory windows is on the works. Initial experiments with ODP look promising. 23

Future Work Leveraging RDMA paging support to reduce the memory foot-print. Plenty of room for performance optimizations. Stability improvements. Libiscsi iser unit tests. 24

Acknowledgments This project was conducted under the supervision and guidance of Dr. Shlomo Greenberg, Ben-Gurion University. Special thanks to Ronnie Sahlberg, creator and maintainer of the Libiscsi library for his support. 25