Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

Similar documents
PARAVIRTUAL RDMA DEVICE

Containing RDMA and High Performance Computing

Asynchronous Peer-to-Peer Device Communication

Advanced Computer Networks. End Host Optimization

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

RoCE Update. Liran Liss, Mellanox Technologies March,

RDMA Container Support. Liran Liss Mellanox Technologies

jverbs: Java/OFED Integration for the Cloud

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

USER SPACE IPOIB PACKET PROCESSING

KVM 在 OpenStack 中的应用. Dexin(Mark) Wu

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Virtual RDMA devices. Parav Pandit Emulex Corporation

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, t he Energy Efficient Solutions logo, mobilegt, PowerQUICC,

High Performance VMM-Bypass I/O in Virtual Machines

Virtio/vhost status update

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Improve VNF safety with Vhost-User/DPDK IOMMU support

VIRTIO: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Accelerating VM networking through XDP. Jason Wang Red Hat

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

RDMA on vsphere: Update and Future Directions

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, t he Energy Efficient Solutions logo, mobilegt, PowerQUICC,

Network Adapter Flow Steering

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Virtio 1 - why do it? And - are we there yet? Michael S. Tsirkin Red Hat

URDMA: RDMA VERBS OVER DPDK

KVM on POWER Status update & IO Architecture

Vhost and VIOMMU. Jason Wang (Wei Xu Peter Xu

RDMA programming concepts

CREATING A COMMON SOFTWARE VERBS IMPLEMENTATION

Quo Vadis Virtio? Michael S. Tsirkin Red Hat

VALE: a switched ethernet for virtual machines

Evolution of the netmap architecture

KVM as The NFV Hypervisor

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Vhost dataplane in Qemu. Jason Wang Red Hat

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers


HKG : OpenAMP Introduction. Wendy Liang

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

ARM-KVM: Weather Report Korea Linux Forum

Xilinx Answer QDMA DPDK User Guide

InfiniBand Linux Operating System Software Access Layer

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

WINDOWS RDMA (ALMOST) EVERYWHERE

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

COSC6376 Cloud Computing Lecture 15: IO Virtualization

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

Improving security and flexibility within Windows DPDK networking stacks RANJIT MENON INTEL OMAR CARDONA MICROSOFT

Introduction to Oracle VM (Xen) Networking

Fabric Interfaces Architecture. Sean Hefty - Intel Corporation

Networking at the Speed of Light

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

12th ANNUAL WORKSHOP 2016 RDMA RESET SUPPORT. Liran Liss. Mellanox Technologies. [ April 4th, 2016 ]

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

FaRM: Fast Remote Memory

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

RoGUE: RDMA over Generic Unconverged Ethernet

Virtualisation: The KVM Way. Amit Shah

MIKELANGELO D3.3. Super KVM - Fast virtual I/O hypervisor. Shiqing Fan, Fang Chen Gabriel Scalosub, Niv Gilboa Justin Činkelj, Gregor Berginc

Open Fabrics Interfaces Architecture Introduction. Sean Hefty Intel Corporation

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

viommu/arm: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

CS370 Operating Systems

RDMA Migration and RDMA Fault Tolerance for QEMU

Nested Virtualization and Server Consolidation

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Live Migration: Even faster, now with a dedicated thread!

14th ANNUAL WORKSHOP 2018 T10-DIF OFFLOAD. Tzahi Oved. Mellanox Technologies. Apr 2018 [ LOGO HERE ]

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

The Non-Volatile Memory Verbs Provider (NVP): Using the OFED Framework to access solid state storage

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

SR-IOV support in Xen. Yaozu (Eddie) Dong Yunhong Jiang Kun (Kevin) Tian

Chapter 5 C. Virtual machines

Data Path acceleration techniques in a NFV world

Application Acceleration Beyond Flash Storage

The Exascale Architecture

Scalable Fabric Interfaces

Virtio SCSI. An alternative virtualized storage stack for KVM. Stefan Hajnoczi Paolo Bonzini

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

Display Virtualization with KVM for Automotive Systems

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

12th ANNUAL WORKSHOP Experiences in Writing OFED Software for a New InfiniBand HCA. Knut Omang ORACLE. [ April 6th, 2016 ]

The Price of Safety: Evaluating IOMMU Performance

Rocker: switchdev prototyping vehicle

Zero-copy Receive for Virtualized Network Devices

Ethernet Services over IPoIB

Support for Smart NICs. Ian Pratt

EE 660: Computer Architecture Cloud Architecture: Virtualization

Transcription:

Agenda About us Why para-virtualize RDMA Project overview Open issues Future plans

About us Marcel from KVM team in Redhat Yuval from Networking/RDMA team in Oracle This is a shared-effort open source project to implement a paravirt RDMA device. We implement VMware s pvrdma device on QEMU since it save us time (guest drivers and lib) 2

Why para-virtualize RDMA Make Guests device-agnostic while leveraging hw capabilities SRIOV becomes optional (by using multiple GIDs) Migration support Memory overcommit (without hw support) Fast build of testing cluster 3

Overview QEMU VM PVRDMA driver PVRDMA device Func1: RoCE Func0: ETH ETH TAP Kernel Ib device rxe PF VF VF VF netdev netdev HW RoCE device ETH ETH 4

Expected performance Comparing: Ethernet virtio NICs (state of the art para-virt) PVRDMA RoCE with soft Roce backend PVRDMA RoCE with Phys RDMA VF. Looking for throughput of small/large packets 5

PCI registers BAR 0: MSIX Vec0: RING EVT Vec1: Async EVT Vec3: Cmd EVT BAR 1: Regs VERSION DSR CTL REQ ERR ICR IMR MAC BAR 2: UAR QP_NUM SEND RECV CQ_NUM ARM POLL 6

PCI registers BAR 1 BAR 1: Regs VERSION DSR CTL REQ ERR ICR IMR MAC Version DSR: Device Shared Region (command channel) CMD slot address RESPONSE slot address Device CQ ring CTL Activate device Quiesce Reset REQ Execute command at CMD slot The result will be stored in RESPONSE slot ERR Error details IMR Interrupt mask 7

PCI registers BAR 2 BAR 2: UAR QP_NUM SEND RECV CQ_NUM ARM POLL QP operations - Guest driver writes to QP offset (0) QP num and op mask - The write operation ends after the operation is sent to the backend device CQ operations - Guest driver writes to CQ offset (4) CQ num and op mask - The write operation ends after the command is sent to the backend device Only one command at a time 8

Resource manager All resources are virtualized (1-1 mapping with backend dev) PDs, QPs, CQs, The pvrdma device shares the resources memory allocated by the guest driver by means of a shared page directory Host VM HW QEMU Memory CQ HW CQ HW QP Virt CQ Virt QP CQ Page directory QP Page directory Snd Rcv QP CQ Page directory QP Page directory Driver CMD Channel (DSR) / Info passed in Create QP/CQ 9

Flows Create CQ Guest driver Allocates pages for CQ ring Creates page directory to hold CQ ring's pages Initializes CQ ring Initializes 'Create CQ' command object (cqe, pdir etc) Copies the command to 'command' address (DSR) Writes 0 into REQ register Device Reads request object from 'command' address Allocates CQ object and initialize CQ ring based on pdir Creates backend(hw) CQ Writes operation status to ERR register Posts command-interrupt to guest Guest driver Reads HW response code from ERR register 10

Flows Create QP Guest driver Allocates pages for send and receive rings Creates page directory to hold the ring's pages Initializes 'Create QP' command object (max_send_wr, send_cq_handle, recv_cq_handle, pdir etc) Copies the object to 'command' address Writes 0 into REQ register Device Reads request object from 'command' address Allocates QP object and initialize Send and recv rings based on pdir Send and recv ring state Creates backend QP Writes operation status to ERR register Posts command-interrupt to guest Guest driver Reads HW response code from ERR register 11

Flows Post receive Guest driver Initializes a wqe and place it on recv ring Writes to qpn qp_recv_bit (31) to QP offset in UAR Device Extracts qpn from UAR Walks through the ring and do the following for each wqe Prepares backend CQE context to be used when receiving completion from backend (wr_id, op_code, emu_cq_num) For each sge prepares backend sge; maps the dma address to qemu virtual address Calls backend's post_send 12

Flows Process completion A dedicated thread is used to process backend events At initialization it attach to device and create communication channel Thread main loop: Polls completion Unmaps sge's virtual addresses Extracts emu_cq_num, wr_id and op_code from context Writes CQE to CQ ring Writes CQ number to device CQ Sends completion-interrupt to guest Deallocates context Acks the event (ibv_ack_cq_events) 13

Open issues RDMA CM support Guest memory registration Migration support Memory overcommit support Providing GIDs to guests 14

Open issues (cont) RDMA CM support Emulate QP1 at QEMU level Modify RDMA CM to communicate with QEMU instances: Pass incoming MADs to QEMU Allow QEMU to forward MADs What would be the preferred mechanism? 15

Open issues (cont) Guest memory registration Guest userspace: Register memory region verb accept only a single VA address specifying both the on-wire Key and a memory range. Guest kernels: In-kernel code may need to register all memory (ranges not used are not accessed kernel is trusted) Memory hotplug changes present ranges 16

Open issues (cont) Migration support Moving QP numbers ensure same QP numbers on destination Memory tracking Migration requires detection of hw accesses to pages Maybe we can avoid the need for it Moving QP state How QP should behave during migration, some busy state Network update/packet loss recovery Can the system recover from some packet loss? 17

Open issues (cont) Memory overcommit support Do we really need on-demand pages? Everything goes through QEMU which is a User Level app in host that fw the request to host kernel. Did we miss anything? 18

Open issues (cont) Providing GIDs to guests (multi-gid) Instead of coming up with extra ipv6 addresses, would be better to set the host GID based on guest ipv6 address. No API for doing that from user level. 19

Future plans Submit VMware s PVRDMA device implementation to QEMU (and later add RoCE v2 support) Improve performance Move to a virtio based RDMA device We would like to agree on a clean design: Virtio queue for the command channel Virtio queue for each Rcv/Send/Cq buffer Anything else? 20