RoCE Update. Liran Liss, Mellanox Technologies March,

Similar documents
RDMA on vsphere: Update and Future Directions

Containing RDMA and High Performance Computing

WINDOWS RDMA (ALMOST) EVERYWHERE

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

Network Adapter Flow Steering

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

RDMA Container Support. Liran Liss Mellanox Technologies

PARAVIRTUAL RDMA DEVICE

Virtual RDMA devices. Parav Pandit Emulex Corporation

IO virtualization. Michael Kagan Mellanox Technologies

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

WinOF-2 Release Notes

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Using SR-IOV offloads with Open-vSwitch and similar applications

RDMA in Embedded Fabrics

Improving security and flexibility within Windows DPDK networking stacks RANJIT MENON INTEL OMAR CARDONA MICROSOFT

Networking at the Speed of Light

Ethernet Services over IPoIB

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

Best Practices for Deployments using DCB and RoCE

WinOF-2 for Windows 2016 Release Notes

All Roads Lead to Convergence

Red Hat Enterprise Linux (RHEL) 7.3 Driver Release Notes

Virtual Networks: Host Perspective

Storage Protocol Offload for Virtualized Environments Session 301-F

Corporate Update. OpenVswitch hardware offload over DPDK. DPDK summit 2017

Unify Virtual and Physical Networking with Cisco Virtual Interface Card

Advanced Computer Networks. End Host Optimization

RoGUE: RDMA over Generic Unconverged Ethernet

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Solarflare and OpenOnload Solarflare Communications, Inc.

Mellanox OFED for FreeBSD for ConnectX-4/ConnectX-4 Lx/ ConnectX-5 Release Note. Rev 3.5.0

Introduction to Infiniband

Mellanox OFED for FreeBSD User Manual

vsphere Networking for the Network Admin Jason Nash, Varrow CTO

InfiniBand OFED Driver for. VMware Virtual Infrastructure (VI) 3.5. Installation Guide

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

InfiniBand OFED Driver for. VMware Infrastructure 3. Installation Guide

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

DPDK Summit China 2017

Application Acceleration Beyond Flash Storage

Cloud Networking (VITMMA02) Server Virtualization Data Center Gear

Windows Server System Center Azure Pack

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

ETHERNET OVER INFINIBAND

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD

iscsi : A loss-less Ethernet fabric with DCB Jason Blosil, NetApp Gary Gumanow, Dell

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Creating an agile infrastructure with Virtualized I/O

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

OFA Developer Workshop 2013

Mellanox DPDK. Release Notes. Rev 16.11_4.0

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Virtualized Access Layer. Petr Grygárek

SR-IOV support in Xen. Yaozu (Eddie) Dong Yunhong Jiang Kun (Kevin) Tian

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SUSE Linux Enterprise Server (SLES) 12 SP3 Driver SLES 12 SP3

Jake Howering. Director, Product Management

Fabric Failover Scenarios in the Cisco Unified Computing System

Configuration Maximums. Update 1 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

Data Path acceleration techniques in a NFV world

QuickSpecs. HP Z 10GbE Dual Port Module. Models

12th ANNUAL WORKSHOP 2016 RDMA RESET SUPPORT. Liran Liss. Mellanox Technologies. [ April 4th, 2016 ]

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Welcome to the IBTA Fall Webinar Series

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

Open vswitch DPDK Acceleration Using HW Classification

USER SPACE IPOIB PACKET PROCESSING

Hypervisors networking: best practices for interconnecting with Cisco switches

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Ubuntu Linux Inbox Driver User Manual

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

UCS Networking 201 Deep Dive

SUSE Linux Enterprise Server (SLES) 15 Inbox Driver Release Notes SLES 15

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

FIBRE CHANNEL OVER ETHERNET

Emulex Universal Multichannel

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

How to Turbocharge Network Throughput

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Windows Server 2016 Software-Defined Networking Oliver Ryf

Virtual Networks: For Storage and Data

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

Design and Implementations of FCoE for the DataCenter. Mike Frase, Cisco Systems

OFA Developer Workshop 2014

UCS C Series Rack Servers VIC Connectivity Options

EXPERIENCES WITH NVME OVER FABRICS

Mellanox MLX4_EN Driver for VMware ESXi 5.1 and ESXi 5.5 User Manual

Emulex Drivers for VMware ESXi for OneConnect Adapters Release Notes

Red Hat Enterprise Linux (RHEL) 7.5-ALT Driver Release Notes

Introduction to High-Speed InfiniBand Interconnect

SUSE Linux Enterprise Server (SLES) 12 SP2 Driver SLES 12 SP2

Transcription:

RoCE Update Liran Liss, Mellanox Technologies March, 2012 www.openfabrics.org 1

Agenda RoCE Ecosystem QoS Virtualization High availability Latest news 2

RoCE in the Data Center Lossless configuration recommended Network configuration options Enable global pause (config) # interface ethernet 1/x flowcontrol send on (config) # interface ethernet 1/x flowcontrol receive on Enable PFC (config) # dcb priority-flow-control priority 3-4 enable Enable priority tagging to work without VLANs (config) # interface ethernet 1/x switchport mode access-dcb www.openfabrics.org 3

RoCE in the Data Center Set up matching host configuration Enable PFC Manually by dcbtool or lldptool-pfc or automatically by DCBX (via lldpad) Call rdma_set_option() with RDMA_OPTION_IP_TOS to determine UP in RoCE connections Sets UP = ip_tos[7:5] (precedence bits) PFC (or global pause) is all it takes to get RoCE working well! www.openfabrics.org 4

40GE RoCE is Here Matches theoretical limit www.openfabrics.org 5

40GE RoCE is Here www.openfabrics.org 6

40GE RoCE is Here www.openfabrics.org 7

Enhanced Transmission Selection (ETS) Provides BW guarantees to traffic classes (TCs) assigned for enhanced transmission selection Denoted by a percentage of the BW remaining after transmitting from TCs subject to strict-priority or credit-based-shaper algorithms Designates User-Priority (UP) to TC mappings Host configuration: dcbtool/lldptool-ets UP0 UP1 UP2 UP3 UP4 UP5 UP6 UP7 UP to TC map TC0 TC1 TC2 TC3 TC4 TC5 TC6 TC7 DWRR Strict Priority www.openfabrics.org 8

QoS Matters #TCP STREAM #TCP_RR QoS OFF RFS / Multi-Tx OFF Latency [us] Total BW [Gbps] QoS OFF RFS / Multi-Tx ON Latency [us] Total BW [Gbps] QoS ON RFS / Multi-Tx ON Latency [us] Total BW [Gbps] 0 1 10.1 0 10.7 0 10.5 0 20 1 10548 8934 37.0 9187 12 9330 No TCP streams With 20 TCP streams QoS OFF QoS OFF QoS ON QoS ON strict sched ETS 99:1 #RRs RFS/MQ OFF RFS/MQ ON RFS/MQ OFF RFS/MQ ON RFS/MQ ON RFS/MQ ON 1 9.0 10.8 10134.2 37.9 12.0 12.0 2 11.9 10.7 14063.6 38.7 11.9 12.2 4 12.3 11.1 9137.2 50.8 12.5 12.8 6 12.9 11.2 12329.0 48.3 12.6 12.6 8 13.9 13.2 16261.5 41.2 15.0 15.0 10 15.2 14.5 12115.6 52.3 16.2 16.3 20 20.4 21.3 11455.8 48.6 23.4 23.2 www.openfabrics.org 9

Granular QoS End-to-end network QoS may not be enough Some applications may require more than 8 (Ethernet) / 15 (Infiniband) QoS levels HW-level QoS under host admin control Control over scheduling of application HW queues User QP UP3 TC2 TCP flow Solution: add another scheduling hierarchy level QoS within a UP Ethernet I/F QP www.openfabrics.org 10

The Complete Picture Host Admin Network Admin www.openfabrics.org 11

Configuration (example) # ethtool f eth2 Fine-grain QoS for eth2: Total number of QoS queues: 128 Current fine-grain QoS settings: UP0 0:100 UP1 0:20 1:80 UP2 0:100 UP3 0:100 UP4 0:50 1:30 2:20 UP5 0:100 UP6 0:100 UP7 0:100 #ethtool -F eth2 up3 10, 40, 50 up1 100 # ethtool f eth2 Total number of QoS queues: 128 Current fine-grain QoS settings: UP0 0:100 UP1 0:100 UP2 0:100 UP3 0:10 1:40 2:50 UP4 0:50 1:30 2:20 UP5 0:100 UP6 0:100 UP7 0:100 www.openfabrics.org 12

Possible APIs Sockets High bits of SO_PRIORITY option RDMACM Add equivalent RDMA_OPTION_PRIORITY Verbs High bits of SL field www.openfabrics.org 13

RoCE SRIOV Each VF exposes a NIC + RoCE device RoCE shares the NIC MAC address GID table entries populated accordingly HW virtual switch settings apply to both MAC assignment / enforcement VLAN enforcement Default / allowable priorities Rate limiting Same drivers for Hypervisor and Guest To be released in MLNX_OFED-2.0 www.openfabrics.org 14

RoCE SRIOV Linux Guest Process User Kernel User tcp/ip mlx4_en scsi mid-layer mlx4_fc mlx4_core ib_core mlx4_ib tcp/ip mlx4_en scsi mid-layer mlx4_fc mlx4_core ib_core mlx4_ib Kernel IOMMU HW commands Doorbells Interrupts and dma from/to device Communication channel ConnectX Doorbells Interrupts and dma from/to device www.openfabrics.org 15

HA/LAG Solutions Today Linux Bonding Active-Backup, transmit/receive load-balancing, 802.3ad Applies only to network interface traffic RDMACM bonding support Active-backup only Binds to active RDMA device upon connection establishment APM Currently IB only but could be enabled for RoCE Applicable for RC and RD EECs In middleware Not generic No single solution fits all! www.openfabrics.org 16

Bonding as Unified Solution? Application Control-plane not fully exposed Socket TCP/IP RoCE QP Eth QP Data-plane does not go through bonding bonding netdev HW netdev netdev HW ib_dev Physical ports not transparent to users p1 p1 p2 www.openfabrics.org 17

Device LAG Support Manage LAG/HA data-plane by the device A single bonded port is exposed to the OS Load-balancing and failover handled below the Verbs Bonding modes: 802.3ad Active-Backup (fail-over MAC disabled) Complete, coherent HA management for all protocols Ethernet interface RoCE kernel ULPs and user applications Including RC QPs no need for APM!!! Raw Ethernet QPs Configure once, apply to all www.openfabrics.org 18

Implementation Bind control-plane to bonding driver Centralized place for LAG/HA configuration Leverage bonding modes and options of existing code RoCE/Raw-Ethernet QP internal configuration On Rx, may receive from both physical ports On Tx, each QP is associated with a virtual port, which can map to any physical port QPs are distributed between virtual ports for load balancing Virtual ports are assigned to different physical ports if available Ethernet driver tracks bonding decisions Modifies the Virtual-to-Physical (V2P) port mappings accordingly Offloaded traffic is sent to mapped physical port www.openfabrics.org 19

Implementation Track bonding events bonding Application QP (vport1) vport not visible to consumers QP (vport2) netdev netdev mlx4_en mlx4_core QP (port1) QP (port2) mlx4_ib ib_dev Modify V2P mappings HW V2P p1 p2 www.openfabrics.org 20

Extensions to SRIOV SW virtual switch configurations often have more than a single uplink Uplinks are teamed for LAG/HA vnics have a single link Teaming is accomplished transparently to the vnic Same model can be used for SRIOV Only a single network interface is passed to the VM Single VF with one link No bonding configuration required in the VM VF supports both vnic and vrdma www.openfabrics.org 21

LAG Virtualization Model VM VM VM SW vnic SW vnic vnic vrdma vnic vrdma Hypervisor SW vswitch RDMA Bonding HW HW vswitch www.openfabrics.org 22

Implementation VM VM VM SW vnic SW vnic vnic vrdma mlx4_core vnic mlx4_core vrdma SW vswitch Bond pnic pnic mlx4_core PF VF V2P VF V2P HW HW vswitch 1 HW vswitch 2 www.openfabrics.org 23

VF Port Assignment Options Distribute VFs between ports Each VF is associated with a single port On port failure, VF migrates with its MAC Bonding modes: Active-Backup, 802.3ad Distribute traffic between ports within VF Ethernet driver hashes traffic according to L2/3/4 Each RoCE/Raw-Ethernet QP is assigned to a single port (as in the non-virtualized case) Bonding mode: 802.3ad www.openfabrics.org 24

Latest News RoCE for Windows included in Mellanox WinOF 3.0 Submitted for Windows-8.0 RoCE RDMA support in SMB-2.2 (Tom Talpey, OFA 12) VMware migration over RoCE (Bhavesh Davda and Josh Simons, OFA 12) 36% improvement in vmotion time >30% higher pre-copy BW >90% reduction in CPU utilization www.openfabrics.org 25

Thank You! www.openfabrics.org 26