RDMA over Commodity Ethernet at Scale

Similar documents
RDMA in Data Centers: Looking Back and Looking Forward

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

Baidu s Best Practice with Low Latency Networks

P802.1Qcz Congestion Isolation

Revisiting Network Support for RDMA

IEEE P802.1Qcz Proposed Project for Congestion Isolation

RDMA and Hardware Support

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

The NE010 iwarp Adapter

Advanced Computer Networks. RDMA, Network Virtualization

RoGUE: RDMA over Generic Unconverged Ethernet

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

DIBS: Just-in-time congestion mitigation for Data Centers

Congestion Control for Large-Scale RDMA Deployments

Advanced Computer Networks. End Host Optimization

Best Practices for Deployments using DCB and RoCE

Multifunction Networking Adapters

Configuring Priority Flow Control

Introduction to Infiniband

Requirement Discussion of Flow-Based Flow Control(FFC)

Configuring Priority Flow Control

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Configuring Priority Flow Control

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Storage Protocol Offload for Virtualized Environments Session 301-F

WINDOWS RDMA (ALMOST) EVERYWHERE

Configuring Priority Flow Control

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Lesson 9 OpenFlow. Objectives :

Advanced Computer Networks. Flow Control

N V M e o v e r F a b r i c s -

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

PARAVIRTUAL RDMA DEVICE

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Birds of a Feather Presentation

Configuring Queuing and Flow Control

Dual Port Fiber 100 Gigabit Ethernet PCI Express Content Director Bypass Server Adapter Intel FM10420 Based

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

RoCE vs. iwarp Competitive Analysis

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

FaRM: Fast Remote Memory

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

An Extensible Message-Oriented Offload Model for High-Performance Applications

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

PE340G2DBIR Dual Port Fiber 40 Gigabit Ethernet PCI Express Content Director Bypass Server Adapter Intel FM10420 Based

Introduction. Network Architecture Requirements of Data Centers in the Cloud Computing Era

Memory Management Strategies for Data Serving with RDMA

InfiniBand Networked Flash Storage

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Cisco Nexus 9300-EX Platform Switches Architecture

Basic Low Level Concepts

- Hubs vs. Switches vs. Routers -

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Revisiting Network Support for RDMA

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Cisco Data Center Ethernet

Future Routing Schemes in Petascale clusters

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Reference Design: NVMe-oF JBOF

Chapter 3 Part 2 Switching and Bridging. Networking CS 3470, Section 1

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

Configuring Queuing and Flow Control

URDMA: RDMA VERBS OVER DPDK

QuickSpecs. Models. HP NC510C PCIe 10 Gigabit Server Adapter. Overview

Hardware-Accelerated Networks at Scale in the Cloud. Daniel Firestone Tech Lead and Manager, Azure Host Networking Group

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Data Center TCP (DCTCP)

Da t e: August 2 0 th a t 9: :00 SOLUTIONS

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

High Performance Packet Processing with FlexNIC

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Congestion Control in Datacenters. Ahmed Saeed

Configuring SPAN and RSPAN

UDP, TCP, IP multicast

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

NVMe Over Fabrics (NVMe-oF)

Approaching Incast Congestion with Multi-host Ethernet Controllers

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

OVS Acceleration using Network Flow Processors

From Routing to Traffic Engineering

Congestion Management for Ethernet-based Lossless DataCenter Networks

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

PE310G4DBi9 Quad port Fiber 10 Gigabit Ethernet PCI Express Content Director Server Adapter Intel based

Impact of Cache Coherence Protocols on the Processing of Network Traffic

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Congestion Notification in Pre-defined Network Domains

Programmable NICs. Lecture 14, Computer Networks (198:552)

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

Optimizing Performance: Intel Network Adapters User Guide

Configuring OpenFlow 1

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Configuring Firewall Filters (J-Web Procedure)

Transcription:

RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion 2

RDMA/RoCEv2 background RDMA addresses TCP s latency and CPU overhead problems RDMA: Remote Direct Memory Access RDMA offloads the transport layer to the NIC RDMA needs a lossless network RoCEv2: RDMA over commodity Ethernet DCQCN for connection-level congestion control PFC for hop-by-hop flow control User Kernel Hardware RDMA app TCP/IP NIC driver RDMA verbs RDMA transport IP Ethernet DMA Lossless network DMA RDMA app RDMA verbs TCP/IP NIC driver RDMA transport IP Ethernet 3

Priority-based flow control (PFC) Hop-by-hop flow control, with eight priorities for HOL blocking mitigation The priority in data packets is carried in the VLAN tag p0 p1 p7 Egress port Data packet p0 p1 PFC pause frame Ingress port p0 p1 p7 XOFF threshold PFC pause frame to inform the upstream to stop Data packet PFC pause frame 4

DSCP-based PFC Issues of VLAN-based PFC It breaks PXE boot No standard way for carrying VLAN tag in L3 networks DSCP-based PFC DSCP field for carrying the priority value No change needed for the PFC pause frame Supported by major switch/nic venders TOR Switch NIC Data packet Trunk mode No VLAN tag when PXE boot PFC pause frame 5

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion 6

RDMA transport livelock Sender RDMA Send 0 Receiver Sender RDMA Send 0 Receiver Switch RDMA Send 1 RDMA Send 1 Sender Pkt drop rate 1/256 Receiver NAK N RDMA Send N+1 RDMA Send N+2 NAK N RDMA Send N+1 RDMA Send N+2 RDMA Send 0 RDMA Send N RDMA Send 1 RDMA Send 2 RDMA Send N+1 RDMA Send N+2 Go-back-0 Go-back-N 7

PFC deadlock Our data centers use Clos network Packets first travel up then go down No cyclic buffer dependency for up-down routing -> no deadlock But we did experience deadlock! Podset Pod Spine Leaf ToR Servers 8

PFC deadlock Preliminaries ARP table: IP address to MAC address mapping MAC table: MAC address to port mapping If MAC entry is missing, packets are flooded to all ports Input ARP table IP MAC TTL IP0 MAC0 2h IP1 MAC1 1h MAC table MAC Port TTL Dst: IP1 MAC0 Port0 10min MAC1 - - Output 9

PFC deadlock Path: {S1, T0, La, T1, S3} La Lb Path: {S1, T0, La, T1, S5} p0 p1 p0 p1 Path: {S4, T1, Lb, T0, S2} T0 2 p2 p3 3 PFC pause frames T1 1 p3 p4 4 Congested port Ingress port p0 p1 p0 p1 p2 Egress port Packet drop Dead server Server S1 S2 S3 S4 S5 PFC pause frames 10

PFC deadlock The PFC deadlock root cause: the interaction between the PFC flow control and the Ethernet packet flooding Solution: drop the lossless packets if the ARP entry is incomplete Recommendation: do not flood or multicast for lossless traffic Call for action: more research on deadlocks 11

NIC PFC pause frame storm A malfunctioning NIC may block the whole network PFC pause frame storms caused several incidents Solution: watchdogs at both NIC and switch sides to stop the storm Podset 0 Podset 1 Spine layer Leaf layer ToRs 0 1 2 3 4 5 6 7 servers Malfunctioning NIC 12

The slow-receiver symptom ToR to NIC is 40Gb/s, NIC to server is 64Gb/s But NICs may generate large number of PFC pause frames Root cause: NIC is resource constrained Mitigation Large page size for the MTT (memory translation table) entry Dynamic buffer sharing at the ToR Server CPU WQEs QPC DRAM PCIe Gen3 8x8 64Gb/s MTT NIC QSFP 40Gb/s ToR Pause frames 13

Outline RDMA/RoCEv2 background DSCP-based PFC Safety challenges RDMA transport livelock PFC deadlock PFC pause frame storm Slow-receiver symptom Experiences and lessons learned Related work Conclusion 14

Latency reduction RoCEv2 deployed in Bing world-wide for one and half years Significant latency reduction Incast problem solved as no packet drops 15

RDMA throughput Using two podsets each with 500+ servers 5Tb/s capacity between the two podsets Achieved 3Tb/s inter-podset throughput Bottlenecked by ECMP routing Close to 0 CPU overhead 16

Latency and throughput tradeoff us L0 L1 L1 L1 T0 T1 S0,0 S0,23 S1,0 S1,23 RDMA latencies increase as data shuffling started Low latency vs high throughput Before data shuffling During data shuffling 17

Lessons learned Deadlock, livelock, PFC pause frames propagation and storm did happen Be prepared for the unexpected Configuration management, latency/availability, PFC pause frame, RDMA traffic monitoring NICs are the key to make RoCEv2 work Loss vs lossless: Is lossless needed? 18

Related work Infiniband iwarp Deadlock in lossless networks TCP perf tuning vs. RDMA 19

Conclusion RoCEv2 has been running safely in Microsoft data centers for one and half years DSCP-based PFC which scales RoCEv2 from L2 to L3 Various safety issues/bugs (livelock, deadlock, PFC pause storm, PFC pause propagation) can all be addressed Future work RDMA for inter-dc communications Understanding of deadlocks in data centers Lossless, low-latency and high-throughput networking Applications adoption 20