RoCE vs. iwarp Competitive Analysis

Similar documents
Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

The NE010 iwarp Adapter

ARISTA: Improving Application Performance While Reducing Complexity

InfiniBand Networked Flash Storage

NVMe over Universal RDMA Fabrics

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Mellanox Virtual Modular Switch

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Ethernet. High-Performance Ethernet Adapter Cards

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

by Brian Hausauer, Chief Architect, NetEffect, Inc

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Storage Protocol Offload for Virtualized Environments Session 301-F

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

RoGUE: RDMA over Generic Unconverged Ethernet

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Best Practices for Deployments using DCB and RoCE

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

Creating an agile infrastructure with Virtualized I/O

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Introduction to Infiniband

Birds of a Feather Presentation

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Application Acceleration Beyond Flash Storage

TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

Uncompromising Performance. Elastic Network Manageability. Maximum System Productivity.

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

iscsi Technology: A Convergence of Networking and Storage

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Future Routing Schemes in Petascale clusters

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Introduction to Ethernet Latency

Uncompromising Performance Elastic Network Manageability Maximum System Productivity

WinOF-2 Release Notes

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Advanced Computer Networks. End Host Optimization

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

FPGA Implementation of RDMA-Based Data Acquisition System Over 100 GbE

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Enabling Efficient and Scalable Zero-Trust Security

USING ISCSI AND VERITAS BACKUP EXEC 9.0 FOR WINDOWS SERVERS BENEFITS AND TEST CONFIGURATION

All Roads Lead to Convergence

NVMe Over Fabrics (NVMe-oF)

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 5.5/6.0 Release Notes

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Single Root I/O Virtualization (SR-IOV) and iscsi Uncompromised Performance for Virtual Server Environments Leonid Grossman Exar Corporation

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Benefits of Offloading I/O Processing to the Adapter

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

PARAVIRTUAL RDMA DEVICE

Sub-microsecond interconnects for processor connectivity The opportunity

VXLAN Overview: Cisco Nexus 9000 Series Switches

Mellanox OFED for FreeBSD for ConnectX-4/ConnectX-4 Lx/ ConnectX-5 Release Note. Rev 3.5.0

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 5.5/6.0 Release Notes

Introduction to High-Speed InfiniBand Interconnect

Memory Management Strategies for Data Serving with RDMA

N V M e o v e r F a b r i c s -

An Approach for Implementing NVMeOF based Solutions

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

FIBRE CHANNEL OVER ETHERNET

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Mellanox ConnectX-3 ESXi 6.5 Inbox Driver Release Notes. Rev 1.0

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Welcome to the IBTA Fall Webinar Series

Multifunction Networking Adapters

RDMA programming concepts

Designing Next Generation FS for NVMe and NVMe-oF

Implementing Storage in Intel Omni-Path Architecture Fabrics

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

EXPERIENCES WITH NVME OVER FABRICS

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

GUIDE. Optimal Network Designs with Cohesity

Transcription:

WHITE PAPER February 217 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...5 Summary...6 Executive Summary Remote Direct Memory Access (RDMA) provides direct access from the memory of one computer to the memory of another without involving either computer s operating system. This technology enables high-throughput, low-latency networking with low CPU utilization, which is especially useful in massively parallel compute clusters. RDMA over Converged Ethernet (RoCE) is the most commonly used RDMA technology for Ethernet networks and is deployed at scale in some of the largest hyper-scale data centers in the world. RoCE is the only industry-standard Ethernet-based RDMA solution with a multi-vendor ecosystem delivering network adapters and operating over standard layer 2 and layer 3 Ethernet switches. The RoCE technology is standardized within industry organizations including the IBTA, IEEE, and IETF. Mellanox Technologies was the first company to implement the new standard, and all of its product families from ConnectX-3 Pro and onward implement a complete offload of the RoCE protocol. These solutions provide wire-speed throughput at up to 1Gb/s throughput and market-leading latency, with the lowest CPU and memory utilization possible. As a result, the ConnectX family of adapters has been deployed in a variety of mission critical, latency sensitive data centers. RoCE s Advantages over iwarp iwarp is an alternative RDMA offering that is more complex and unable to achieve the same level of performance as RoCE-based solutions. iwarp uses a complex mix of layers, including DDP (Direct Data Placement), a tweak known as MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. This convoluted architecture is an ill-conceived attempt to fit RDMA into existing software transport frameworks. Unfortunately this compromise causes iwarp to fail to deliver on precisely the three key benefits that RoCE is able to achieve: high throughput, lowlatency, and low CPU utilization. In addition to the complexity and performance disadvantages, only a single vendor (Chelsio) is supporting iwarp on their current products, and the technology has not been well adopted by the market. Intel previously supported iwarp in its 1GbE NIC from 29, but has not supported it in any of its newer NICs since then. No iwarp support is available at the latest Ethernet speeds of 25, 5, and 1Gb/s.

page 2 iwarp is designed to work over the existing TCP transport, and is essentially an attempt to patch up existing LAN/WAN networks. The Ethernet data link delivers best effort service, relying on the TCP layer to deliver reliable services. The need to support existing IP networks, including wide area networks, requires coverage of a larger set of boundary conditions with respect to congestion handling, scaling, and error handling, causing inefficiency in hardware offload of the RDMA and associated transport operations. RoCE, on the other hand, is a purpose-built RDMA transport protocol for Ethernet, not as a patch to be used on top of existing TCP/IP protocols. iwarp Network Layers RDMA APPLICATION VERBS INTERFACE OFA (Open Fabric Alliance) Stack RoCE Network Layers RDMA APPLICATION VERBS INTERFACE OFA (Open Fabric Alliance) Stack RDMAP DDP MPA IBTA Transport Protocol TCP IP Ethernet Link Layer UDP IP Ethernet Link Layer ETHERNET MANAGEMENT ETHERNET MANAGEMENT Figure 1. iwarp s complex network layers vs. RoCE s simpler model Because TCP is connection-based, it must use reliable transport. iwarp, therefore, only supports reliable connected transport service, whch also means that is is not an appropriate platform for multicast. RoCE offers a variety of transport services, including reliable connected, unreliable datagram, and others, and enables user-level multicast capability. iwarp traffic also cannot be easily managed and optimized in the fabric itself, leading to inefficiency in deployments. It does not provide a way to detect RDMA traffic at or below the transport layer, for example within the fabric itself. Sharing of TCP s port space by iwarp makes using flow management impossible, since the port alone cannot identify whether the message carries RDMA or traditional TCP. iwarp shares the protocol number space with legacy TCP traffic, so context (state) is required to determine that a packet is iwarp. Typically, this context may not fit in the NIC s on-chip memory, which results in much more complexity and therefore longer time in traffic demultiplexing. This also occurs in the switches and routers of the fabric, where there is no such state available. In contrast, a packet can be identified as RoCE simply by looking at its UDP destination port field. If the value matches the IANA assigned port for RoCE then the packet is RoCE. This stateless traffic identification allows for quick and early demultiplexing of traffic in a converged NIC implementation, and enables capabilities such as switch or fabric monitoring and access control lists (ACLs) for improved traffic flow analysis and management. Similarly, because iwarp shares port space with the legacy TCP stack, it also faces challenges integrating with OS stacks. RoCE, on the other hand, offers full OS stack integration. These challenges limit the cost-effectiveness and deployability of iwarp products, especially in comparison to RoCE. RoCE includes IP and UDP headers in the packet encapsulation, meaning that RoCE can be used across both L2 and L3 networks. This enables layer 3 routing, which brings RDMA to networks with multiple subnets. Resilient RoCE enables running RoCE on Lossy fabrics, which do not enable Flow Control or Priority Flow Control. RoCE s advanced hardware mechanisms deliver RDMA performance on lossy networks on par with that of lossless networks.

page 3 Application User Soft-RoCE user driver HW-RoCE user driver Sockets Kernel TCP/IP Soft-RoCE kernel driverr HW-RoCE kernel driver NIC driver HW Ethernet NIC RoCE HCA Figure 2. Soft-RoCE Architecture Finally, by deploying Soft-ROCE (Figure 2), the implementation of RoCE via software, RoCE can be expanded to devices that do not natively support RoCE in hardware. This enables greater flexibility in leveraging RoCE s benefits in the Data Center. Performance and Benchmark Examples EDC latency-sensitive applications such as Hadoop for real-time data analysis are cornerstones of competitiveness for Web2. and Big Data providers. Such platforms can benefit from Mellanox s ConnectX-3 Pro, as its RoCE solution delivers extremely low latencies on Ethernet while scaling to handle millions of messages per second. Benchmarks comparing the performance of Chelsio s T5 and T6 messaging applications running over 25, 4, and 1Gb Ethernet iwarp against the ConnectX-3 Pro with RoCE shows that RoCE consistently deliver messages significantly faster than iwarp (Figure 3). 4GbE RDMA Latency 6 Latency (μs) 5 4 3 2 1 8 16 496 Chelsio iwarp ConnectX-3 Pro RoCE Figure 3. 25, 4, and 1Gb Ethernet Latency Benchmarks

page 4 When measuring ConnectX-3 s RoCE latency against Intel s NetEffect 2 iwarp, the results are even more impressive. At 1Gb, RoCE showed an 86% improvement using RoCE at B message size, and a % improvement at B (Figure 4). Message Size (Bytes) NetEffect 2 iwarp ConnectX-3 1GbE RoCE ConnectX-3 4GbE RoCE B 7.22μs 1μs.78μs B 1.58μs 3.79μs 2.13μs Figure 4. Mellanox RoCE and Intel iwarp Latency Benchmark Meanwhile, throughput when using RoCE at 4Gb on ConnectX-3 Pro is over 2X higher than using iwarp on the Chelsio T5 (Figure 5), and 5X higher that using iwarp at 1Gb with Intel (Figure 6). 4GbE RDMA Throughput (RDMA Write) Bandwidth (Gb/s) 4 35 3 25 2 15 1 5 2.3 1.2 4.6 2.3 9.1 4.4 18.2 8.6 33.8 18.2 36.7 37. 31. 35.4 Chelsio iwarp ConnectX-3 Pro RoCE Figure 5. 4Gb Ethernet Throughput Benchmark Bandwidth (Gb/s) 1 9 8 7 6 5 4 3 2 1 1GbE RDMA Throughput (RDMA Write) 1.8.3 3.6.6 5.3 1.1 6.9 2.2 8.2 4.2 9. 7.8 9.1 Intel iwarp Figure 6 1Gb Ethernet Throughput Benchmark ConnectX-3 RoCE The performance advantages are maintained whether using RoCE over a lossless or a lossy network (Figure 7). With Resilient RoCE, Mellanox can provide consistent, top performance with congestion control in lossy environments.

page 5 Figure 7. Host send and receive throughput (5KB packets) in lossy network with RoCE and iwarp Best Performance for Virtualization A further advantage to RoCE is its ability to run over SR-IOV, enabling RoCE s superior performance of the lowest latency, lowest CPU utlilization, and maximum throughput, in a virtualized environment. RoCE has proven it can provide less than 1 us latency between virtual machines while maintaining consistent throughput as the virtual environment scales. Chelsio s iwarp does not run over multiple VMs in SR-IOV, relying instead on TCP for VM-to-VM communication. The difference in latency is astounding (Figure 8). 25 VM-to-VM Latency RoCE vs. TCP Latency (μs) 2 15 1 5 8 16 496 iwarp 4GbE ConnectX-3 Pro RoCE Figure 8. 4Gb Ethernet Latency from VM to VM

page 6 Summary RoCE simplifies the transport protocol; it bypasses the TCP stack to enable true and scalable RDMA operations, resulting in higher ROI. RoCE is a standard protocol, which was built specifically with data center traffic in mind, with consideration paid to latency, performance, and CPU utilization. It performs especially well in virtualized environments. When a network runs over Ethernet, RoCE provides a superior solution compared to iwarp. For the enterprise data center seeking the ultimate in performance, RoCE is clearly the choice, especially when latency-sensitive applications are involved. Moreover, RoCE is currently deployed in dozens of data centers with up to hundreds of thousands of nodes, while iwarp is virtually non-existent in the field. Simply put, RoCE is the obvious way to deploy RDMA over Ethernet. 35 Oakmead Parkway, Suite 1, Sunnyvale, CA 9485 Tel: 48-97-34 Fax: 48-97-343 www.mellanox.com Copyright 217. Mellanox Technologies. All rights reserved. Mellanox, Mellanox logo, and ConnectX are registered trademarks of Mellanox Technologies, Ltd. All other trademarks are property of their respective owners. 15-4514WP Rev 2.