SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Similar documents
RoCE vs. iwarp Competitive Analysis

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

URDMA: RDMA VERBS OVER DPDK

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Advanced Computer Networks. End Host Optimization

RoGUE: RDMA over Generic Unconverged Ethernet

An Approach for Implementing NVMeOF based Solutions

by Brian Hausauer, Chief Architect, NetEffect, Inc

RDMA over Commodity Ethernet at Scale

RDMA and Hardware Support

Light & NOS. Dan Li Tsinghua University

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

IsoStack Highly Efficient Network Processing on Dedicated Cores

Best Practices for Deployments using DCB and RoCE

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

NVMe over Universal RDMA Fabrics

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Much Faster Networking

Revisiting Network Support for RDMA

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

The NE010 iwarp Adapter

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

jverbs: Java/OFED Integration for the Cloud

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Xen Network I/O Performance Analysis and Opportunities for Improvement

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Why AI Frameworks Need (not only) RDMA?

Custom UDP-Based Transport Protocol Implementation over DPDK

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

PARAVIRTUAL RDMA DEVICE

Baidu s Best Practice with Low Latency Networks

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Designing Next Generation FS for NVMe and NVMe-oF

The Common Communication Interface (CCI)

The Power of Batching in the Click Modular Router

Software Routers: NetMap

Application Acceleration Beyond Flash Storage

Containing RDMA and High Performance Computing

Storage Protocol Offload for Virtualized Environments Session 301-F

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

RDMA programming concepts

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Evolution of the netmap architecture

Impact of Cache Coherence Protocols on the Processing of Network Traffic

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, t he Energy Efficient Solutions logo, mobilegt, PowerQUICC,

Multifunction Networking Adapters

Data Path acceleration techniques in a NFV world

Networking at the Speed of Light

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

FPGA Implementation of RDMA-Based Data Acquisition System Over 100 GbE

Programmable NICs. Lecture 14, Computer Networks (198:552)

Learning with Purpose

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, t he Energy Efficient Solutions logo, mobilegt, PowerQUICC,

RoCE Update. Liran Liss, Mellanox Technologies March,

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Network stack specialization for performance

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

HP Cluster Interconnects: The Next 5 Years

Memcached Design on High Performance RDMA Capable Interconnects

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Data Center Traffic and Measurements: SoNIC

Netchannel 2: Optimizing Network Performance

Memory Management Strategies for Data Serving with RDMA

PASTE: A Networking API for Non-Volatile Main Memory

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp

Reference Design: NVMe-oF JBOF

libvnf: building VNFs made easy

LegUp: Accelerating Memcached on Cloud FPGAs

RDMA in Data Centers: Looking Back and Looking Forward

An Experimental review on Intel DPDK L2 Forwarding

A Look at Intel s Dataplane Development Kit

WINDOWS RDMA (ALMOST) EVERYWHERE

FPGA Augmented ASICs: The Time Has Come

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

What is RDMA? An Introduction to Networking Acceleration Technologies

High Performance File Serving with SMB3 and RDMA via SMB Direct

Introduction to Infiniband

Accelerating Web Protocols Using RDMA

Evaluation of the Chelsio T580-CR iscsi Offload adapter

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

InfiniBand Networked Flash Storage

Introduction to Ethernet Latency

Enabling Fast, Dynamic Network Processing with ClickOS

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Using (Suricata over) PF_RING for NIC-Independent Acceleration

How to Build a 100 Gbps DDoS Traffic Generator

Transcription:

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet Mao Miao, Fengyuan Ren, Xiaohui Luo, Jing Xie, Qingkai Meng, Wenxue Cheng Dept. of Computer Science and Technology, Tsinghua University,

Background Remote Direct Memory Access (RDMA) - Protocol offload reduce CPU overhead and bypass kernel - Memory pre-allocation and pre-registering - Zero-Copy Data transferred directly from userspace Domain RDMA Network Protocols - InfniBand (IB) - RDMA Over Converged Ethernet (RoCE) - Internet Wide Area RDMA Protocol (iwarp)

Background InfniBand (IB) - Custom network protocol and purpose-built HW - Lossless L2 uses hop-by-hop, credit-based flow control to prevent packet drops Challenges - Incompatible with Ethernet infrastructure - Cost much DC operators need to deploy and manage two separate networks

Background RDMA Over Converged Ethernet (RoCE) - Indeed IB over Ethernet - Routability RoCEv2 currently includes UDP and IP layers to actually provide that capability - Lossless L2 Priority-based Flow Control (PFC) Challenges - Complex and restrictive configuration - Perils of using PFC in a large-scale deployment head-of-line blocking, unfairness, spreading congestion problems, etc

Background Internet Wide Area RDMA Protocol (iwarp) - Enables RDMA over the existing TCP/IP Leverages TCP/IP for reliability and congestion control mechanisms to ensure scalability, routability and reliability - Only NICs (RNIC) should be specially built, no other changes required Challenges - Specially-built RNIC

Motivation Common challenges for RDMA deployment - Specific and non-backward compatible devices Ethernet non-compatibility Inflexibility of HW devices - Expensive and high cost Equipment replacement Extra burden of operation management Is it possible to design software RDMA (SoftRDMA) over commodity Ethernet devices?

Motivation Software Framework Evolution - High-performance packet I/O Intel DPDK, netmap, PacketShader I/O (PSIO) - High-performance user-level stack mtcp, IX, Arrakis, Sandstorm - Drive the technical changes Memory resources pre-allocation and re-use Zero-copy Kernel bypassing Batch processing Affinity and prefetching

Motivation Much similar design philosophy between RDMA and novel SW evolvements - Memory resources pre-allocation - Zero-copy - Kernel bypassing Can we design SoftRDMA based on the high-performance packet I/O? - Comparable performance to RDMA schemes - No customized devices required - Compatible with Ethernet infrastructures

SoftRDMA Design: Dedicated Userspace iwarp Stack User Space Applications Verbs API Applications Verbs API Applications Verbs API RDMAP User Space RDMAP RDMAP Kernel Space DDP MPA DDP MPA Usersp ace DDP MPA TCP IP Kernel Space TCP IP TCP IP NIC driver Data Link NIC driver Data Link NIC driver DPDK Data Link User-level iwarp + Kernel-level TCP/IP Kernel-level iwarp + Kernel-level TCP/IP User-level iwarp + User-level TCP/IP

SoftRDMA Design: Dedicated Userspace iwarp Stack In-kernel stack - Mode switching overhead - Complexity of kernel modification User-level stack - Eliminate mode switching overhead - More free space for stack design

One-Copy versus Zero-Copy Seven steps for Pkts from NIC to App Two-Copy - Step 6: copied from RX ring buffer for stack processing - Step 7: copied to different Apps buffer after processing NIC Interrupt Handler 2 2 Interrupt Generator NIC1 DMA Engine Netif_rx_schedule() Raised softirq Hardware Interrupt 1 1 3 3 6 7 8 5 Control flow 5 dev->poll Device Driver 4 Poll_queue() 3 NIC1 2 check RX Ring Buffer 1 4 Softirq Net_rx_action 5 4 1 st COPY 2 nd COPY 7 App recv TCP IP Stack 6 Processing 6 7 Kernel Data flow App1 read() App2 read() User Space

One-Copy versus Zero-Copy One-Copy - Memory mapping between kernel and user space to remove the copy in Step 7 Zero-Copy - Sharing the DMA region to remove the copy in Step 6 NIC Interrupt Handler 2 Interrupt Generator NIC1 DMA Engine Netif_rx_schedule() Raised softirq Hardware Interrupt 1 3 6 7 8 5 Control flow 5 dev->poll Device Driver 4 Poll_queue() 3 NIC1 2 check RX Ring Buffer 1 4 Softirq Net_rx_action 1 st COPY 2 nd COPY 7 App recv TCP IP Stack 6 Processing 6 7 Kernel Data flow App1 read() App2 read() User Space

One-Copy versus Zero-Copy Two obstacles for Zero-Copy in stack processing - Where to put the input Pkts for different Apps and how to manage them? Unaware about the application-appointed placeto store the input Pkts before stack processing - Whether the DMA region is large enough or could be reused fast to hold input packets? The DMA region is finite and fixed, which could only store up to thousands of input packets SoftRDMA adopts One-Copy

SoftRDMA Threading Model Event Conditions App Batched Systcalls Thread 1 Thread 2 TCP/IP TCP/IP (c1) Traditional multi-threading model (c1) - One thread for App processing, the other for Pkts RX/TX - Good for throughput as batch processing - Higher latency as the thread switching and communication cost

SoftRDMA Threading Model App Thread 1 Thread 2 TCP/IP TCP/IP User Space NIC Driver (c2) Run-to-completion threading model (c2) - Run all stages (Pkt RX/TX, APP processing ) into completion - Indeed improve the latency - Sophisticated processing may make the Pkt loss

SoftRDMA Threading Model App Thread 1 Thread 2 TCP/IP TCP/IP (c3) SoftRDMA threading model (c3) - One thread for Pkts RX, including One-Copy, the other for App processing and Pkts TX - Accelerate the Pkt receiving process - App processing and Pkts TX run within a thread to improve the efficiency and reduce the latency

SoftRDMA Implementation 20K lines of code, 7.8K are new - DPDK I/O - User-level TCP/IP based on lwip raw API - MPA/DDP/RDMAP layer of iwarp RDMA Verbs

SoftRDMA Performance Experiment config - DELL PowerEdge R430 - Intel 82599ES 10 GbE NIC - Chelsio T520-SO-CR 10GbE RNIC Four RDMA implementation schemes - Hardware-supported RNIC (iwarp RNIC) - User-level iwarp based on kernel-socket (Kernel Socket) - User-level iwarp based on DPDK-based lwip sequential API (Sequential API) - User-level iwarp based on DPDK-based lwip raw API (SoftRDMA)

SoftRDMA Implementation Short Message ( 10KB) Transfer - The close latency metric SoftRDMA: 6.63us/64B 6.80us/1KB 52.20us/10KB RNIC: 3.59us/64B 5.29us/1KB 16.27us/10KB - The throughput falls far behind Acceptable for short message delivering

SoftRDMA Implementation Long Message (10KB-500KB) Transfer - The close latency metric SoftRDMA: 101.36us/100KB 500.06us/500KB RNIC: 93.45us/100KB 432.50us/500KB - The close throughput performance SoftRDMA: 1461.71Mbps/10KB 7893.31Mbps/100KB RNIC: 8854.16Mbps/10KB 8917.44Mbps/100KB

Next work A more stable and robust user-level stack NICs HW features utilized to accelerate the protocol processing - TSO/LSO/LRO - Memory based scatter/gather for Zero-Copy More comparison experiments - Tests among SoftRDMA, iwarp NIC, RoCE NIC - Tests on 40GbE/50GbE devices

Conclusion SoftRDMA: a high-performance software RDMA implementation over commodity Ethernet - The dedicated userspace iwarp stack based on high-performance network I/O - One-Copy - The carefully designed threading model

Thanks! Q&A