Revisiting Network Support for RDMA

Similar documents
Revisiting Network Support for RDMA

RoGUE: RDMA over Generic Unconverged Ethernet

P802.1Qcz Congestion Isolation

RDMA over Commodity Ethernet at Scale

Baidu s Best Practice with Low Latency Networks

RDMA and Hardware Support

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

TCP Congestion Control

Requirement Discussion of Flow-Based Flow Control(FFC)

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control

Tagger: Practical PFC Deadlock Prevention in Data Center Networks

The NE010 iwarp Adapter

RDMA in Data Centers: Looking Back and Looking Forward

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

IEEE P802.1Qcz Proposed Project for Congestion Isolation

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Universal Packet Scheduling. Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker UC Berkeley

Lecture 15: Datacenter TCP"

Universal Packet Scheduling. Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker UC Berkeley

RoCE vs. iwarp Competitive Analysis

Packet Scheduling in Data Centers. Lecture 17, Computer Networks (198:552)

Congestion Control. EE122 Fall 2012 Scott Shenker

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

15-744: Computer Networking TCP

Lecture 15: TCP over wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 13, Thursday

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

Outline Computer Networking. TCP slow start. TCP modeling. TCP details AIMD. Congestion Avoidance. Lecture 18 TCP Performance Peter Steenkiste

Computer Networking

SharkFest'17 US. Understanding Throughput & TCP Windows. A Walk-Through of the Factors that can limit TCP Throughput Performance

Configuring Queuing and Flow Control

Reasons not to Parallelize TCP Connections for Fast Long-Distance Networks

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Congestion Control. Daniel Zappala. CS 460 Computer Networking Brigham Young University

Announcements. Congestion Control. A few words from Panda. Congestion Control Review. Load and Delay. Caveat: In this lecture

Best Practices for Deployments using DCB and RoCE

Transport layer issues

Congestion Control for Large-Scale RDMA Deployments

Congestion Control in Datacenters. Ahmed Saeed

Linux Plumbers Conference TCP-NV Congestion Avoidance for Data Centers

Data Center TCP (DCTCP)

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Data Center TCP (DCTCP)

Computer Networks. Course Reference Model. Topic. Congestion What s the hold up? Nature of Congestion. Nature of Congestion 1/5/2015.

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

Cloud e Datacenter Networking

Configuring Queuing and Flow Control

Low Latency via Redundancy

Introduction to Infiniband

Lecture 14: Congestion Control"

Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data?

Communication Networks

Sharing High-Performance Devices Across Multiple Virtual Machines

Page 1. Review: Internet Protocol Stack. Transport Layer Services. Design Issue EEC173B/ECS152C. Review: TCP

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

TCP conges+on control

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Chapter III: Transport Layer

Universal Packet Scheduling. Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker UC Berkeley

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

Computer Networks Spring 2017 Homework 2 Due by 3/2/2017, 10:30am

An Approach for Implementing NVMeOF based Solutions

6.033 Spring 2015 Lecture #11: Transport Layer Congestion Control Hari Balakrishnan Scribed by Qian Long

Internet Protocols Fall Lecture 16 TCP Flavors, RED, ECN Andreas Terzis

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Information-Agnostic Flow Scheduling for Commodity Data Centers

InfiniBand Networked Flash Storage

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

TCP Congestion Control : Computer Networking. Introduction to TCP. Key Things You Should Know Already. Congestion Control RED

Report on Transport Protocols over Mismatched-rate Layer-1 Circuits with 802.3x Flow Control

TCP Congestion Control

TCP Congestion Control

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

PARAVIRTUAL RDMA DEVICE

FPGA Implementation of RDMA-Based Data Acquisition System Over 100 GbE

The desire for higher interconnect speeds between

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Performance Consequences of Partial RED Deployment

Chapter 3 Transport Layer

Page 1. Review: Internet Protocol Stack. Transport Layer Services EEC173B/ECS152C. Review: TCP. Transport Layer: Connectionless Service

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Overview. TCP congestion control Computer Networking. TCP modern loss recovery. TCP modeling. TCP Congestion Control AIMD

Advanced Computer Networks. Datacenter TCP

ISSN: Index Terms Wireless networks, non - congestion events, packet reordering, spurious timeouts, reduce retransmissions.

Configuring Priority Flow Control

FaRM: Fast Remote Memory

CSE 123A Computer Networks

Department of EECS - University of California at Berkeley EECS122 - Introduction to Communication Networks - Spring 2005 Final: 5/20/2005

Congestion Control. Brighten Godfrey CS 538 January Based in part on slides by Ion Stoica

Network Management & Monitoring Network Delay

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Consistency in SDN. Aurojit Panda, Wenting Zheng, Xiaohe Hu, Arvind Krishnamurthy, Scott Shenker

CS268: Beyond TCP Congestion Control

Transcription:

Revisiting Network Support for RDMA Radhika Mittal 1, Alex Shpiner 3, Aurojit Panda 1, Eitan Zahavi 3, Arvind Krishnamurthy 2, Sylvia Ratnasamy 1, Scott Shenker 1 (1: UC Berkeley, 2: Univ. of Washington, 3: Mellanox Inc.)

Rise of RDMA in datacenters Traditional Networking Stack RDMA User Application Data User Application Data OS Copy OS Copy Hardware NIC Specialized NIC RDMA enables low CPU utilization, low latency, high throughput

Current Status RoCE (RDMA over Converged Ethernet) canonical approach for deploying RDMA in datacenters. needs lossless network to get good performance. Network made lossless using Priority Flow Control (PFC) Complicates network management, congestion spreading, deadlocks

Current Status RoCE (RDMA over Converged Ethernet) canonical approach for deploying RDMA in datacenters. needs lossless network to get good performance. Network made lossless using Priority Flow Control (PFC) Complicates network management, congestion spreading, deadlocks

Current Status RoCE (RDMA over Converged Ethernet) Is losslessness really needed? canonical approach for deploying No! RDMA in datacenters. needs lossless network to get good performance. Simple changes to NIC design enable similar andlossless often better without Network made usingperformance Priority Flow Control (PFC) losslessness. Complicates network management, congestion spreading, deadlocks

Background

Evolution of RDMA Traditionally used in Infiniband clusters. Losses are rare (credit-based flow control). NICs were not designed to deal with losses efficiently. Receiver discards out-of-order packets. Sender does go-back-n on detecting packet loss (via timeouts/negative acks).

Go-Back-N Loss Recovery 1 2 3 4 5 2 3 4 5 Retransmit all packets sent after the last acknowledged packet.

RDMA over Converged Ethernet RoCE: RDMA over Ethernet fabric. RoCEv2: RDMA over IP-routed networks. Infiniband transport was adopted as it is. Go-back-N loss recovery Needs losslessless for good performance. Network was made lossless using PFC.

Why not iwarp? Designed to support RDMA over generic (nonlossless) networks. Implements entire TCP stack in hardware along with multiple other layers. Not as popular as RoCE Less mature, more power, more expensive. RoCE + PFC emerged as popular choice.

Priority Flow Control (PFC) XOFF frame sent when queuing exceeds a certain threshold to pause transmission. Resumed Paused XOFF XON Switch A Switch B

Drawbacks of PFC Adds complexity to network management. Need enough headroom to absorb all packets in flight.

Drawbacks of PFC Adds complexity to network management. Need enough headroom to absorb all packets in flight. XOFF Switch A Switch B

Drawbacks of PFC Performance Issues Unfairness, HoL blocking

Drawbacks of PFC Performance Issues Unfairness, HoL blocking Switch A Switch B

Drawbacks of PFC Performance Issues Unfairness, HoL blocking Switch A Switch B

Drawbacks of PFC Performance Issues Unfairness, HoL blocking Switch A Switch B

Drawbacks of PFC Congestion Spreading

Drawbacks of PFC Congestion Spreading Switch A Switch B

Drawbacks of PFC Deadlocks caused by cyclic buffer dependency A C B

Congestion Control with RoCE Advanced Congestion control RoCE DCQCN: Zhu et al, SIGCOMM 2015 (Microsoft) ECN-based congestion control Implemented on NIC hardware (Mellanox ConnectX4) Timely: Mittal et al, SIGCOMM 2015 (Google) Delay-based congestion control

Recent Works highlighting PFC Issues RDMA over commodity Ethernet at scale, SIGCOMM 2016 Deadlocks in datacenter: why do they form and how to avoid them, HotNets 2016 Unlocking credit loop deadlock, HotNets 2016 s C A B

Our approach Based on the principle of iwarp NIC efficiently deals with packet losses But on the implementation of RoCE Incorporate only necessary bare-minimal features

Is losslessness needed for RDMA?

Experimental Setup Mellanox simulator modeling MLX CX4 NICs, built on Omnet/Inet. Three layered fat-tree topology. Links with capacity10gbps and delay 2us. Heavy-tailed distribution at 70% utilization.

Metrics Average Flow Completion Time 99%ile Flow Completion Time Average Slowdown

Current RoCE NICs PFC helps a lot.

Current RoCE NICs PFC helps a lot.

Current RoCE NICs PFC helps a lot.

Key results PFC is needed with current RoCE NICs. PFC is not needed with IRN. IRN performs better than current RoCE NICs.

Improved RoCE NIC (IRN) Three key changes: Selective retransmit instead of go-back-n Back-off on losses Cap the number of outstanding bytes by BDP

IRN: no advanced congestion control Disabling PFC gives much better performance.

IRN: no advanced congestion control Disabling PFC gives much better performance.

IRN: no advanced congestion control Disabling PFC gives much better performance.

IRN with advanced congestion control PFC is not needed.

IRN with advanced congestion control PFC is not needed.

IRN with advanced congestion control PFC is not needed.

Key results PFC is needed with current RoCE NICs. PFC is not needed with IRN. IRN performs better than current RoCE NICs.

IRN vs RoCE IRN performs better than RoCE.

IRN vs RoCE IRN performs better than RoCE.

IRN vs RoCE IRN performs better than RoCE.

Key results PFC is needed with current RoCE NICs. PFC is not needed with IRN. IRN performs better than current RoCE NICs.

Robustness of results Tested a wide range of experimental scenarios: - Higher link bandwidth of 40Gbps and 100Gbps - More uniform workload distribution - Varying link delay - Varying link utilization Our key take-away hold across all of these scenarios.

Efficient RDMA transport does not require a lossless network. With simple NIC changes, generic out-of-the-box Ethernet network outperforms a lossless network.

Deep Dive

Improved RoCE NIC (IRN) Three key changes: Selective retransmit instead of go-back-n Back-off on losses Cap the number of outstanding bytes by BDP

Improved RoCE NIC (IRN) Three key changes: Selective retransmit instead of go-back-n Back-off on losses Cap the number of outstanding bytes by BDP

Need for Selective Retransmit 1 2 3 4 5 2 3 4 5 1 2 3 4 5 Disabling selective retransmit for 2 IRN+DCQCN increased avg FCT by 28% for our default scenario. Go-Back-N Selective Retransmit

Improved RoCE NIC (IRN) Three key changes: Selective retransmit instead of go-back-n Back-off on losses Cap the number of outstanding bytes by BDP

Benefit of Backing Off on Losses Exploit losses as congestion signal Found to be particularly useful o When no advanced congestion control is being used. - Disabling this feature for IRN increased avg FCT by 158% for our default scenario. o For scenarios where advanced congestion control does not react fast enough. - Disabling this feature for IRN+DCTCP increased avg FCT by 140% when link bandwidth is 100Gbps.

Improved RoCE NIC (IRN) Three key changes: Selective retransmit instead of go-back-n Back-off on losses Cap the number of outstanding bytes by BDP

Benefit of capping by BDP Upper bound on number of out-of-order packets. Improves performance irrespective of whether PFC is enabled or disabled. - Disabling this feature for IRN+DCQCN increased the avg FCT by 64% for our default scenario.

Design Details IRN supports both window mode or rate mode. Start at line rate (or with cwnd = BDP) Losses detected via: o Three dupacks - selective retransmit + reduce rate or cwnd by half o Partial ack - selective retransmit o Fixed timeout - go-back-n + reduce rate or cwnd by half Option for additive increase on new acks. For RDMA reads: requester generates read acks.

Implementation Feasibility Support for out of order packet delivery at the receiver New feature in Mellanox CX5 NICs for adaptive routing. Straight forward to implement selective retransmit. Other IRN changes: few additional instructions. Increase in per-flow state: less than 3%.

Summary IRN performs better than RoCE and does not need PFC. Holds across various congestion control algorithms and experimental scenarios. The NIC changes required are simple and feasible.

Thank you!