PARAVIRTUAL RDMA DEVICE

Similar documents
RDMA on vsphere: Update and Future Directions

Containing RDMA and High Performance Computing

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

Advanced Computer Networks. End Host Optimization

IO virtualization. Michael Kagan Mellanox Technologies

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

URDMA: RDMA VERBS OVER DPDK

RoCE Update. Liran Liss, Mellanox Technologies March,

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

The NE010 iwarp Adapter

EXPERIENCES WITH NVME OVER FABRICS

Application Acceleration Beyond Flash Storage

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Generic RDMA Enablement in Linux

Sharing High-Performance Devices Across Multiple Virtual Machines

Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vsphere T E C H N I C A L W H I T E P A P E R

The Exascale Architecture

Virtual RDMA devices. Parav Pandit Emulex Corporation

RoCE vs. iwarp Competitive Analysis

High Performance VMM-Bypass I/O in Virtual Machines

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

HPC Performance in the Cloud: Status and Future Prospects

Asynchronous Peer-to-Peer Device Communication

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Novell Infiniband and XEN

Welcome to the IBTA Fall Webinar Series

RoGUE: RDMA over Generic Unconverged Ethernet

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

The Non-Volatile Memory Verbs Provider (NVP): Using the OFED Framework to access solid state storage

Open Fabrics Workshop 2013

Memory Management Strategies for Data Serving with RDMA

RDMA in Embedded Fabrics

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Introduction to Infiniband

COSC6376 Cloud Computing Lecture 15: IO Virtualization

Storage Protocol Offload for Virtualized Environments Session 301-F

Creating an agile infrastructure with Virtualized I/O

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

A Novel Approach to Gain High Throughput and Low Latency through SR- IOV

IOV hardware and software plumbing for VMware ESX - journey to uncompromised I/O Virtualization and Convergence.

Writing RDMA applications on Linux

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Advanced Computer Networks. RDMA, Network Virtualization

An Approach for Implementing NVMeOF based Solutions

Accessing NVM Locally and over RDMA Challenges and Opportunities

Rack Disaggregation Using PCIe Networking

RDMA Container Support. Liran Liss Mellanox Technologies

InfiniBand Networked Flash Storage

Networking at the Speed of Light

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reference Design: NVMe-oF JBOF

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

08:End-host Optimizations. Advanced Computer Networks

WINDOWS RDMA (ALMOST) EVERYWHERE

12th ANNUAL WORKSHOP Experiences in Writing OFED Software for a New InfiniBand HCA. Knut Omang ORACLE. [ April 6th, 2016 ]

The Common Communication Interface (CCI)

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Single Root I/O Virtualization (SR-IOV) and iscsi Uncompromised Performance for Virtual Server Environments Leonid Grossman Exar Corporation

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

12th ANNUAL WORKSHOP 2016 RDMA RESET SUPPORT. Liran Liss. Mellanox Technologies. [ April 4th, 2016 ]

QuickSpecs. HP Z 10GbE Dual Port Module. Models

USER SPACE IPOIB PACKET PROCESSING

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Solutions for Scalable HPC

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

jverbs: Java/OFED Integration for the Cloud

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

by Brian Hausauer, Chief Architect, NetEffect, Inc

FaRM: Fast Remote Memory

Mellanox ConnectX-4 NATIVE ESXi Driver for VMware vsphere 5.5/6.0 Release Notes. Rev /

Server Virtualization Approaches

RDMA programming concepts

vsphere Networking Update 1 Modified on 04 OCT 2017 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Implementing Virtual NVMe for Flash- Storage Class Memory

A Userspace Packet Switch for Virtual Machines

Revisiting Network Support for RDMA

Unify Virtual and Physical Networking with Cisco Virtual Interface Card

RDMA over Commodity Ethernet at Scale

NVMe over Universal RDMA Fabrics

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Data Path acceleration techniques in a NFV world

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

ETHOS A Generic Ethernet over Sockets Driver for Linux

Transcription:

12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang, Shelley Gong VMware, Inc. [ April 5th, 2016 ]

MOTIVATION User Kernel Socket App Buffers Sockets API Sockets TCP IPv4/IPv6 Network Device Buffer Headers RDMA App Buffers Kernel Bypass RDMA Verbs API RDMA Enables OS bypass Zero-copy Low Latency (<1µs) High Bandwidth InfiniBand Switch InfiniBand Device Driver iwarp (Internet Wide Area RDMA Protocol) Host Channel Adapter RoCE (RDMA over Converged Ethernet) Ethernet Switch Why not PCI Passthrough? No live migration support Transport dependent Needs an HCA Cannot share non-sriov HCA 2

INTRODUCTION Paravirtual RDMA (PVRDMA) is a new PCIe virtual NIC Supports standard Verbs API Uses HCA for performance, but works without it Multiple virtual devices can share an HCA without SR-IOV Supports vmotion (live migration)! 3

ARCHITECTURE Exposes a dual function PCIe device to the guest VMXNET3 RDMA (RoCE) RDMA component reuses Ethernet properties from the paired NIC Plugs into the OFED stack in the VM Provides verbs-level emulation Guest kernel driver User level library Operates over ESX RDMA stack(vmkernel) GIDs generated by guest kernel registered with HCA Guest OS 1 RDMA App Buffers libvrdma libibverbs PVRDMA Driver PVRDMA Device Emulation PVRDMA NIC 4

ARCHITECTURE (CONT.) Virtualize some hardware resources (like QPs and MRs) Required for vmotion Create corresponding physical resources on the HCA Emulation layer HCA vqp Virtual MR/QPs -> Physical Post SQ WQE RQ WQE QP SQ RQ vcq CQE Physical QPs -> Virtual Poll CQ WQE WQE CQE Network 5

ARCHITECTURE (CONT.) Guest MR registered directly with the HCA Guest PA converted to machine addresses Zero-copy Guest userspace Guest kernel Device emulation HCA Applica-on buffer Guest VA, length Guest PA list MA (host PA) list Guest VA - > MA list HCA 6

CONTROL AND DATA PATH Guest OS 1 RDMA App Buffers libvrdma libibverbs PVRDMA NIC PVRDMA Driver PVRDMA Device Emulation Control Path Data Path Guest OS 2 RDMA App Buffers libvrdma libibverbs PVRDMA Driver PVRDMA Device Emulation ESXi RDMA Stack ESXi RDMA Stack HCA Device Driver HCA Device Driver HCA RoCE (RDMA over Converged Ethernet) 7

RDMA TRANSPORT SELECTION ESX Host 1 ESX Host 2 ESX Host 3 Memcpy vsphere Distributed Switch HCA RDMA HCA TCP NIC PVRDMA Transport Selection Memcpy RDMA between peers on same host TCP RDMA between peers without HCAs (slow path) RDMA Fast Path RDMA between peers with HCAs PVRDMA vmotion Leverage transport selection to support vmotion of RDMA VMs 8

vmotion Challenge:- Lots of RDMA state within hardware Physical resource IDs (like QPNs/MR keys) may change after migration Peers will not be aware of the new IDs Currently, no support to create resources with specified IDs 9

vmotion Current (partial) solution:- Emulation layer can get virtual to physical translations from peer Notify peer about vmotion and pause QP/CQ processing After vmotion resume QPs with the new translations Invisible to guest Can only work when both endpoints are VMs 10

vmotion (FUTURE WORK) Support vmotion when one of endpoints is native (non-vm) Need hardware support Recreate specific QPNs and MR keys Ability to pause and resume QP state on the hardware Save/Restore intermediate QP states Provide isolated resource space to each PVRDMA device Guarantee that specified resources can be recreated Avoid collisions with existing resources Expose hardware resources directly to guest Lower virtualization overhead 11

PERFORMANCE Testbed 2 x Dell T320 Hosts E5-2440 @ 2.40GHz, 24 GiB, Mellanox ConnectX - 3 VMs: Ubuntu 12.04, 3.5.0.45, x86_64, 2 vcpus, 2 GiB OFED Send Latency Test Half RTT for 10K iterations 12

CURRENT LIMITATIONS Communication between VM and native endpoints not supported Need a way to create resources with specified IDs May need additional hardware support from vendors Formalize vmotion support on hardware Currently only supports RoCEv1 in the guest Can still operate over underlying RoCEv2-only HCA No InfiniBand/iWARP support (future work) No remote READ/WRITE support on DMA MRs No SRQ/Atomics support yet SRQs not currently supported on host ESX Only supports Linux guests currently No failover support for PVRDMA 13

12th ANNUAL WORKSHOP 2016 THANK YOU Aditya Sarwade [asarwade@vmware.com] VMware, Inc.