Revisiting Network Support for RDMA

Size: px

Start display at page:

Download "Revisiting Network Support for RDMA"

Loraine Mills
6 years ago
Views:

1 Revisiting Network Support for RDMA Radhika Mittal, Alex Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, Scott Shenker UC Berkeley, ICSI, Mellanox, Univ. of Washington Abstract With the advent of RoCE (RDMA over Converged Ethernet), recent years have witnessed a significant increase in the usage of RDMA in Ethernet-based datacenter networks. RoCE NICs only achieve good performance when run over a lossless network, which is obtained by enabling Ethernet s Priority Flow Control (PFC) mechanism. However, PFC introduces significant problems, such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. In this paper, we ask: is PFC fundamentally required for deploying RDMA over Ethernet, or is their use merely an artifact of the current RoCE NIC design? We find that while PFC is indeed needed for current RoCE NICs, it is unnecessary (and sometimes significantly harmful) when one updates RoCE NICs to a more appropriate (yet still feasible) design. We, thus, propose a new improved RoCE NIC (IRN) design, that does not require PFC and performs better than current RoCE NICs. Our results hold across different experimental scenarios and across different (and even without any) congestion control algorithms. Therefore, our results suggest that to avoid the many problems with PFC, we should adopt this new IRN design for running RDMA in datacenters. 1 Introduction Datacenter networks, when compared to more traditional wide-area networks, offer higher bandwidths and lower latencies. However, traditional endhost networking stacks, with their high latencies and substantial CPU overhead, have limited the extent to which applications can make use of these characteristics. As a result, several large datacenters have recently adopted RDMA which bypasses the traditional networking stacks in favor of direct memory accesses [24, 29, 37]. RDMA over Converged Ethernet (RoCE) has emerged as the canonical method for deploying RDMA in Ethernetbased datacenters [24, 37]. The centerpiece of RoCE is a NIC that (i) provides mechanisms for accessing host memory without CPU involvement and (ii) supports very basic network transport functionality. Early experience revealed that RoCE NICs only achieve good end-to-end performance when run over a lossless network, so operators turned to Ethernet s Priority Flow Control (PFC) mechanism to achieve minimal packet loss. The combination of RoCE and PFC has enabled a wave of datacenter RDMA deployments. However, the current solution is not without problems. In particular, PFC adds management complexity and can lead to significant performance problems such as head-ofthe-line blocking, congestion spreading, and occasional deadlocks [24, 25, 34, 36, 37]. Rather than proceed along the current path and address the various problems with PFC, in this paper we take a step back and ask whether it was needed in the first place. To be clear, current RoCE NICs require a lossless fabric for good performance. However, the question we raise is: can the RoCE NIC design be altered so that losslessness is not needed? We answer this question in the affirmative and propose a new design called IRN (for Improved RoCE NIC) that makes two key changes to current RoCE NICs (i) more efficient loss recovery, and (ii) basic end-to-end flow control that bounds the number of in-flight packets by the bandwidth-delay product of the network. 1 We show, via extensive simulations, that IRN does not require PFC, yet performs better than current RoCE NICs with PFC ( 3). While our design requires extensions to the RDMA packet format ( 4), we show that it is relatively easy to implement in hardware by synthesizing IRN s packet processing logic to target an FPGA device supported by Mellanox Innova Flex 4 NICs and by comparing IRN s memory requirements to those of current RoCE NICs ( 5). Before describing our design, we first review some background. 1 These changes are orthogonal to the congestion control mechanisms that have been implemented on top of RoCE NICs (e.g., DCQCN [37] and Timely [29]). 1

2 2 Background 2.1 Evolution of RDMA RDMA has long been used by the HPC community in special-purpose Infiniband clusters that use credit-based flow control to make the network lossless [5]. Because packet drops are rare in such clusters, the RDMA Infiniband transport as implemented on the NIC was not designed to efficiently recover from packet losses. When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender. When the sender receives a NACK or experiences a timeout it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-n retransmission). To take advantage of the widespread use of Ethernet in datacenters, RoCE [6, 10] was introduced to enable the use of RDMA over Ethernet. 2 RoCE adopted the same Infiniband transport design (including go-back-n loss recovery) and the network was made lossless using PFC. Early versions of RoCE were shipped without any in-built congestion control mechanism (and instead used a static rate for each flow). When it became clear that PFC had significant problems, congestion control mechanisms [29, 37] were proposed for RoCE that reduced the sending rate on detecting congestion, thus reducing the queue build-up and, consequently, the negative impact of PFC. In particular, DCQCN [37], an ECN-based congestion control scheme, was added as a feature on Mellanox RoCE NICs [33], and Timely [29] was proposed as an application-layer delay-based scheme which could be used with RoCE. RoCE is not the only approach to deploying RDMA over Ethernet. iwarp [32] was designed to support RDMA over a generic (non-lossless) network. It implements the entire TCP stack in hardware along with multiple other layers that it needs to translate TCP s byte stream semantics to RDMA segments. It is significantly more complex than Infiniband-based RDMA, and early iwarp implementations had suboptimal performance, making RoCE the current canonical choice for deploying RDMA over Ethernet. Our approach to improving RoCE is to adopt the architectural approach of iwarp (having the NIC handle basic loss recovery) but the implementation approach of RoCE (adding only a minimal set of additional features to current RoCE NICs rather than putting an entire set of protocols in hardware). 2.2 Priority Flow Control Priority Flow Control (PFC) [7] is an Ethernet flow control mechanism; a switch sends a pause (or X-OFF) mes- 2 We use the term RoCE for both RoCE [6] and its successor Ro- CEv2 [10] that enables running RDMA, not just over Ethernet, but also over IP-routed networks. sage to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold. When the queue drains below the threshold, an X-ON message is sent to resume transmission. When configured correctly, PFC can result in lossless networks (as long as all network elements remain functioning). Over the last couple of years, numerous papers have been published highlighting the drawbacks of PFC [24,25, 34, 36, 37] which range from mild (e.g., unfairness, headof-line blocking, collateral damage) to severe, such as pause spreading as highlighted in [24] 3 and the deadlock problems in [25, 34, 36]. While none of these problems are definitively fatal, there is wide agreement that PFC makes networks harder to understand and manage, and can lead to myriad performance problems. 3 Is a lossless network needed for good performance? In this section we show that the need for PFC is merely an artifact of the current RoCE NIC design (in particular, its transport logic). We begin with describing the transport logic for an Improved RoCE NIC (IRN) in 3.1. For simplicity, we present it as a general design independent of the specific RDMA operation (be it Write, Read, Send or Atmomic). We go into the details of handling specific RDMA operations with IRN later in 4. We then describe our experimental set up in 3.2, present our basic results in 3.3 and analyze them in more detail in IRN Transport Logic As discussed in 2, current RoCE NICs use a go-back-n loss recovery scheme. As we will see in 3.3, redundant retransmissions caused by go-back-n loss recovery, in the absence of PFC, negatively impact performance. Therefore, the first change we make with IRN is to make the loss recovery more efficient, based on selective retransmission (inspired by TCP s loss recovery), where the receiver does not discard out of order packets 4 and the sender selectively retransmits the lost packets. The second change we make with IRN is introducing the notion of a basic end-to-end packet level flow control, where we cap the number of outstanding packets in flight for a flow (or a QP) by the bandwidth-delay product (BDP) of the network as suggested in [17]. This is a static cap that we compute by dividing the BDP (in bytes) with the packet MTU set by the QP (typically 1KB in RoCE NICs). We call this BDP-FC (BDP Flow Control), which improves the performance by reducing unnecessary queuing in the network. Furthermore, it provides a strict upper bound on the number of out-of-order packets that can be seen 3 Examples include a single malfunctioning NIC that can pause the entire network and a particular reaction between PFC and Ethernet flooding that results in a network deadlock. 4 We discuss how the out-of-order packets are handled by the receiver in details in 4. 2

3 by the receiver, which, as we will see later in 5, eases the state management in the NICs. We elaborate on the performance impact of these changes in 3.4. The IRN loss recovery and BDP-based flow control mechanisms are largely orthogonal to the specific congestion control algorithm (such as DCQCN [37] and Timely [29]) that can be used on top of it, as with current RoCE NICs. Moreover, IRN, like RoCE transport, operates directly on the RDMA segments using packet semantics. We describe both IRN s flow control mechanism (BDP- FC) and loss recovery mechanism in greater detail below: BDP-FC: An IRN sender transmits a new packet only if the number of packets in flight (computed as the difference between current packet s sequence number and last acknowledged sequence number) is less than the BDP cap. It requires the IRN receiver to send per-packet ACKs. Loss Recovery: Upon every out-of-order packet arrival, an IRN receiver sends a NACK, which carries both the cumulative acknowledgment number (indicating its expected sequence number) and the sequence number of the packet that triggered the NACK (as a simplified form of selective acknowledgement). An IRN sender enters loss recovery mode when a NACK is received or when a timeout occurs. It maintains a bitmap to track which packets have been cumulatively and selectively acknowledged, which gets updated by subsequent NACKs that are received. When in the loss recovery mode, rather than sending new packets when the Tx is free, the sender selectively retransmits lost packets as indicated by the bitmap. The first packet that is retransmitted on entering loss recovery corresponds to the cumulative acknowledgement value. Any subsequent packet is considered lost only if another packet with a higher sequence number has been selectively acked. The sender continues to transmit new packets (if allowed by BDP-FC) if there are no more lost packets to retransmit. It exits loss recovery when a cumulative acknowledgement greater than the recovery sequence is received, where the recovery sequence corresponds to the last regular packet that was sent before the retransmission of a lost packet. NACKs with selective acknowledgement allow efficient loss recovery only when there are multiple packets in flight. For other cases (example, for single packet messages), loss recovery gets triggered via timeouts. A high RTO timer value can increase the tail latency of these short single-packet messages. However, keeping the RTO timer value too small, can result in too many spurious retransmissions, affecting the overall results. An IRN sender, therefore, uses a low timer value of RTO low only when there are a small N number of packets in flight and a higher value of RTO high otherwise. 5 We discuss how the timeout feature in current RoCE NICs can be easily 5 We also experimented with dynamically computed timeout values extended to support this in 5.2. As mentioned before, IRN s loss recovery has been inspired by TCP s loss recovery. However, rather than incorporating the entire TCP stack as is done by iwarp NICs, IRN (1) decouples loss recovery from congestion control and does not incorporate any notion of TCP window control involving slow start, AIMD or advanced fast recovery (2) operates directly on RDMA segments instead of using a byte stream abstraction as in TCP, which allows IRN to simplify the selective acknowledgement scheme and tracking packet losses (as we will discuss in 5.2). 3.2 Experimental Settings Simulator: Our simulator, obtained from a commercial NIC vendor, extends INET/OMNET++ [1, 2] to model the Mellanox ConnectX4 RoCE NIC [11]. RDMA Queue Pairs (QPs) are modelled as UDP applications with either RoCE or IRN transport layer logic, that generate flows (whose sizes are determined by the input workload as described later). We define a flow as a unit of data transfer comprising of one or more messages between the same source-destination pair as in [29, 37]. When the sender QP is ready to transmit data packets, it periodically polls the MAC layer until the link is available for transmission. The simulator implements the version of DCQCN implemented by the Mellanox ConnectX-4 ROCE NIC [33], and we add support for a NIC-based Timely implementation. When using DCQCN congestion control with IRN we include the ECN echo in the IRN ACK/NACK packets. For DCQCN with RoCE, where ACK/NACKs are not sent for every packet, we use a separate CNP packet as in [33, 37]. Finally, in our simulation, all switches are input-queued with virtual output ports, that are scheduled using round-robin. The switches can be configured to generate PFC frames by setting appropriate buffer thresholds. Default Case Scenario: For our default case, we simulate a 54-server three-tiered fat-tree topology, connected by a fabric with full bisection-bandwidth constructed from 45 6-port switches organized into 6 pods [16]. We consider 40Gbps links with a propagation delay of 2µs, resulting in a bandwidth-delay product (BDP) of 120KB along the longest (6-hop) path. This corresponds to about 111 MTU-sized packets (assuming typical RDMA MTU of 1KB and adding headers). Each end host generates new flows with Poisson interarrival times [17, 30]. Each flow s destination is picked randomly and size is drawn from a realistic heavy-tailed distribution derived from [19]. Most flows are small (50% of the flows are single packet messages with sizes ranging between 32 bytes-1kb representing small RPCs such as those generated by RDMA based key-value stores [22, (as with TCP), which not only complicated the design, but did not help with this case, since these effects were then be dominated by the initial timeout value. 3

4 PFC No PFC RoCE RoCE + Timely RoCE + DCQCN. (s) PFC No PFC RoCE RoCE + Timely RoCE + DCQCN (s) PFC No PFC RoCE RoCE + Timely RoCE + DCQCN Figure 1: The figures compare the PFC-enabled vs PFC-disabled performance for RoCE transport with and without explicit congestion control schemes along three metrics: average slowdown, average, and. PFC IRN No PFC (s) PFC IRN No PFC (s) PFC IRN No PFC 18 PFC IRN+Timely No PFC IRN+DCQCN (s) PFC IRN+Timely No PFC IRN+DCQCN (s) PFC IRN+Timely No PFC IRN+DCQCN (a) IRN (b) IRN with explicit congestion control Figure 2: The figures compare the PFC-enabled vs PFC-disabled performance of IRN when used (a) without any explicit congestion control and (b) with explicit congestion control schemes using three metrics (average slowdown, average and ). Note that the y-axis range is different from that in Figure 1. 26]), and most of the bytes are in large flows (15% of the flows are between 200KB-3MB, representing background RDMA traffic such as storage). The network load is set at 70% utilization for our default case. We use ECMP for load-balancing [24]. We vary different aspects from our default scenario (including topology size, workload pattern and link utilization) in 3.4. Parameters: For IRN, we set RTO low to 100µs (with N = 3). RTO high is set to an estimation of the maximum round trip time with one congested link. We compute this as the sum of the propagation delay on the longest path and the maximum queuing delay a packet would see if the switch buffer on a congested link is completely full. This is approximately 320µs for our default case. When using RoCE without PFC, we use a fixed timeout of value RTO high. We disable timeouts when PFC is enabled to prevent spurious retransmissions. For our default case, we use buffers sized at twice the BDP of the network (which is 240KB in our default case) for each input port [17, 18]. The PFC threshold at the switches is set to the buffer size minus a headroom equal to the upstream link s bandwidthdelay product (needed to absorb all packets in flight along the link). This is 220KB for our default case. We vary these parameters in 3.4 to show that our results are not very sensitive to these specific choices. When using RoCE or IRN with Timely or DCQCN, we use the same congestion control parameters as specified in [29] and [37] respectively. For fair comparison with PFC-based proposals [36, 37], the flow starts at line-rate for all cases (with and without congestion control). Methodology: We run the simulation for a fixed duration (100ms by default). We then consider all flows that started within a specified time interval in this duration (12.5ms to 50ms for our default case, resulting in about 25 thousand flows) and compute the following three metrics: (i) average slowdown, where slowdown for a flow is its completion time divided by the time it would have taken to traverse its path at line rate in an empty network, (ii) average flow completion time (), (iii) or tail flow completion time. While the average and tail s are dominated by the performance of throughput-sensitive flows, the average slowdown is dominated by the performance of latency-sensitive short flows. 3.3 Basic Results We now confront the central question of this paper: Does RDMA require a lossless network? If the answer is yes, then we must address the many difficulties of PFC. If the answer is no, then we can leave PFC behind and utilize common network infrastructure without further complications. We begin with some basic results for our default scenario Performance with Current RoCE NICs Figure 1 shows the performance of current RoCE NICs with and without PFC (and with various forms of congestion control). The first set of bars in each graph uses the current RoCE design as described in 2 with no explicit congestion control. The use of PFC helps considerably. Removing PFC increases the average slowdown, the average and the tail by more than 2.5, 3, and 1.5 respectively. The last two set of bars in each graph use RoCE with Timely and RoCE with DCQCN. PFC continues to be important, with performance improvements ranging from 1.35 to 3.5 across the three metrics. Thus, it is clear that PFC provides significant benefits for current RoCE NICs. This is largely because of the goback-n loss recovery used by current RoCE NICs, which penalizes performance due to (1) increased congestion created by redundant retransmissions and (2) the time (and bandwidth) wasted by flows in sending redundant packets when recovering from losses. 4

5 RoCE + PFC IRN/RoCE + Timely + DCQCN. (s) RoCE + PFC IRN/RoCE + Timely + DCQCN (s) RoCE + PFC IRN/RoCE + Timely + DCQCN Figure 3: The figures compare the PFC-enabled RoCE performance vs PFC-disabled IRN performance with and without different congestion control schemes using three metrics (average slowdown, average and ) Performance with IRN We now evaluate whether or not the need for PFC is a fundamental requirement for good performance by comparing IRN s performance with and without PFC. In Figure 2(a) we show the results of this comparison when no explicit congestion control is used. Remarkably, we find that not only is PFC not required, but it significantly degrades IRN s performance (increasing the value of each metric by about ). This is because of the head-of-the-line blocking and congestion spreading issues PFC is notorious for: pauses triggered by congestion at one link, cause queue build up and pauses at other upstream entities, creating a cascading effect. Note that packet drops also negatively impact performance, since it takes about one round trip time to detect a packet loss and another round trip time to recover from it. However, the negative impact of a packet drop (given efficient loss recovery), is restricted to the flow that faces congestion and does not spread to other flows, as in the case of PFC. While these PFC issues have been observed before [24, 29, 37], we believe our work is the first to show that given efficient loss recovery, PFC can have a greater negative impact on performance than packet drops. The previous comparison showed that PFC is not necessary when using IRN without any explicit congestion control mechanism. However, as mentioned before, RoCE today is typically deployed in conjunction with some explicit congestion control mechanism such as Timely and DCQCN. Figure 2(b) shows the performance with and without PFC when we use these congestion control mechanisms on top of our IRN transport base. 6 We find that IRN s performance is largely independent of whether PFC is enabled or disabled the largest negative impact of disabling PFC was less than 1% and the largest improvement due to disabling PFC was about 3.5%. This is because, with explicit congestion control, both the PFC rate reduces (reducing the impact of congestion spreading) and the number of packet drops reduces, thus, reducing the difference between PFC-enabled and PFC-disabled performance. We report the corresponding drop rates and 6 Note that while the average and tail was slightly smaller when IRN was used without any congestion control algorithm, the average slowdown is significantly lower when IRN is used with Timely and DC- QCN. This is because both Timely and DCQCN aim to achieve smaller queues (and therefore, smaller packet latencies), while compromising a little on the throughput IRN + PFC IRN + No PFC IRN w/ Go-Back-N + No PFC IRN IRN + Timely IRN + DCQCN Figure 4: The figure compares the average for IRN with go-back-n loss recovery, IRN with the default selective retransmit based loss recovery and IRN with PFC, across different congestion control algorithms. The y-axis is capped at 0.003s to better highlight the trends.. (s) PFC rates in A.1 of the attached appendix Current RoCE NICs vs IRN The previous results compared IRN s performance with and without PFC, showing that it does not require losslessness. Figure 3 compares the performance of IRN without PFC vs current RoCE NIC with PFC. We find that IRN without PFC performs about better than RoCE + PFC, across all metrics and congestion control schemes. This is because IRN s BDP-FC mechanism reduces unnecessary queuing and disabling PFC eliminates congestion spreading issues. Key Takeaways: The following are the three takeaways from these results: (1) Current RoCE NICs require PFC to achieve good performance (2) IRN does not require PFC, (3) IRN performs better than current RoCE NICs. We analyze these results in more details next. 3.4 Detailed Analysis We now explore these basic results in more details, by performing a factor analysis of IRN and evaluating the robustness of our trends across different experimental scenarios IRN Factor Analysis: As discussed in 3.1, IRN makes two key changes to the RoCE NIC: (1) efficient loss recovery and (2) BDP-FC. We now analyze the significance of these two changes. Need for Efficient Loss Recovery: To study the significance of efficient loss recovery, we enabled go-back-n loss recovery for IRN and compared its performance with our default selective retransmit based IRN and with IRN + PFC (the latter two data points being the same as those evaluated in 3.3). Figure 4 shows the average for the three cases. We find that IRN with go-back-n loss recovery performs significantly worse than both IRN with 5

6 IRN + PFC IRN without BDP-FC + PFC IRN + No PFC IRN without BDP-FC + No PFC IRN IRN + Timely IRN + DCQCN Figure 5: The figures compares the average for PFC-enabled and PFC-disabled peformance of IRN without BDP-FC, across different congestion control algorithms.. (s) the default selective retransmit loss recovery and IRN with PFC. We saw similar trends for average slowdown and tail. This is because of the bandwidth wasted by go-back-n due to redundant retransmissions. Before converging to the current IRN loss recovery mechanism (based on simplified SACKs), we also experimented with other simpler loss recovery mechanisms. In particular we explored the following two questions: (1) Can go-back-n be made more efficient? We found that explicitly backing off on losses (by reducing the rate by half whenever a packet loss is detected) reduced the negative impact of go-back-n for Timely (though, not for DCQCN). Nonetheless, selective retransmit based loss recovery continued to perform significantly better across different scenarios (with the difference in average for Timely ranging from 20%-50%). (2) Do we need SACKs? We also tried a selective retransmit scheme without SACKs (where the sender does not maintain a bitmap to track selective acknowledgements). This performed better than go-back-n. However, it fared poorly when there were multiple losses in a window, requiring multiple round-trips to recover from them. The corresponding degradation in average ranged from <1% to as high as 75% across different scenarios when compared to IRN s SACK-based loss recovery. Significance of BDP-FC: We next disabled BDP-FC for IRN (for both PFC-enabled and PFC-disabled case) 7 and compared its performance with BDP-FC enabled IRN (as evaluated in 3.3). Figure 5 reports the resulting average s. We find that BDP-FC significantly improves performance by reducing unnecessary queuing (along with packet drops or PFC frames). Further, for the PFCdisabled case, BDP-FC prevents a flow that is recovering from a packet loss from sending additional new packets and increasing congestion, until the acknowledgement number has been advanced after loss recovery. Notice that, for our default case, IRN s performance without PFC continues to be better than with PFC when BDP-FC is disabled. This is because the congestion spreading issue is further intensified without BDP-FC. However, we will later discuss a specific scenario (incast) where BDP-FC plays a crucial role in getting good performance without PFC. 7 IRN with PFC and no BDP-FC is essentially the same as RoCE + PFC with per-packet acks Robustness across varying scenarios We now discuss the robustness of IRN, as the experimental scenario was varied from our default case. We experimented with (1) link utilization levels varied between 30%-90%, (2) link bandwidths varied from the default case of 40Gbps to 10Gbps and 100Gbps, (3) larger fattree topologies with 128 and 250 servers, (4) a different workload with flow sizes uniformly distributed between 500KB to 5MB, representing background and storage traffic for RDMA, (5) the per-port buffer size varied between 60KB-480KB, (6) varying other IRN parameters (increasing RTO high value by upto 4 times the default of 320µs, increasing the N value for using RTO low to 10 and 15). We present a brief summary of our results here and provide the detailed results for each of these scenarios in A.2 of the attached appendix. Overall Results: Our results across these different scenarios have been summarized below. (a) When IRN was used without any congestion control, disabling PFC always improved performance, with the maximum improvement in performance across different scenarios being as high as 71%. (b) When IRN was used with Timely and DCQCN, disabling PFC often resulted in better performance (with the maximum improvement in performance being 28% for Timely and 17% for DCQCN). Any degradation in performance due disabling PFC with IRN stayed within 1.6% for Timely and 5.1% for DCQCN. (c) IRN without PFC always performed better than RoCE with PFC, with the performance improvement ranging from 6% to 83% across different cases. Some observed trends: We now list some interesting trends we observed across these scenarios. (a) The benefits of disabling PFC with IRN generally increase with increasing link utilization, as the negative impact of congestion spreading with PFC increases. (b) The benefits of disabling PFC with IRN decrease with increasing bandwidths, since the relative cost of a round trip time spent on detecting and reacting to packet drops also increases. (c) The benefits of disabling PFC with IRN increase with decreasing buffer sizes because of higher number of PFC frames and greater impact of congestion spreading. In general, as the buffer size is increased, the difference between PFC-enabled and PFC-disabled performance with IRN reduces (due to fewer PFC frames and packet drops), while the benefits of using IRN over RoCE+PFC increases (because of greater relative reduction in queuing delay with IRN due to BDP-FC). 8 We also implemented conventional window-based congestion control schemes such as TCP s AIMD and DCTCP with IRN and observed similar trends. In fact, 8 We expect to see similar behaviour in shared buffer switches. 6

7 CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) (a) No CC (b) Timely (c) DCQCN Figure 6: The figures compare the tail latency for single-packet messages for IRN (with and without PFC) and current RoCE NICs with PFC, across different congestion control algorithms. RCT Ratio ( / IRN + PFC) NoCC DCQCN Timely Number of senders (M) Figure 7: The figure shows the ratio of request completion time of incast with IRN without PFC over IRN with PFC for varying degree of fan-ins across congestion control algorithms. when IRN is used with TCP s AIMD logic, the benefits of disabling PFC were even stronger. This is because TCP exploits packet drops as a congestion signal (which is lost when PFC is enabled). We further observed that while the performance penalty for doing go-back-n was smaller for window-based schemes, IRN s selective retransmit based loss recovery scheme continued to perform significantly better across different scenarios. More details on this can be found in A.3 of the attached appendix Tail latency for small messages We now look at the tail latency (or tail ) of the singlepacket messages (with sizes 32 bytes-1kb) from our default scenario, which is another metric that often matters in datacenters [29]. Figure 6 shows the tail CDF of latency (from 90%ile to 99.9%ile) for these messages, across different congestion control algorithms. Our key trends from 3.3 holds even for this metric. This is because IRN without PFC is able to recover from single-packet message losses quickly due to the low RTO low timeout value. With IRN + PFC, these messages end up waiting in the queues for similar (or greater) duration due to pauses and congestion spreading. For all cases, IRN performs significantly better than RoCE + PFC (the latter seeing larger negative impact due to congestion spreading) Incast We now consider different incast scenarios, both with and without cross-traffic. The incast workload without any cross traffic can be identified as the best case for PFC, since only valid congestion-causing flows are paused without unnecessary head-of-the-line blocking. Incast without cross-traffic: We simulate the incast workload on our default topology by striping 150MB of data across M randomly chosen sender nodes that send it to a fixed destination node [17]. We vary M from 10 to 50. We consider the request completion time (RCT) as the metric for incast performance, which is when the last flow completes. For each M, we repeat the experiment hundred times and report the average RCT from these runs. Figure 9 shows the results, comparing IRN with and without PFC. Here we see the reverse trend from our previous results, where IRN without any congestion control was most adversely affected by disabling PFC. Nonetheless, any increase in the RCT due to disabling PFC remained less than 2.5%. The corresponding results comparing IRN with RoCE + PFC, looked very similar to this, with the ratio of RCT of IRN without PFC over RoCE + PFC staying within 0.95 and across different congestion control schemes. In other words, the degradation due to using IRN without PFC (if any) stayed within 2.4%. We present these results in A.3 of the appendix. Robustness: We observed that disabling BDP-FC with IRN for this experiment resulted in mor than 10 higher RCTs. Nonetheless, our results were fairly robust to overestimation of BDP: using a window of (3 BDP) for BDP-FC resulted in at most 6% increase in RCT without PFC for all (except one) 9 data points. We saw similar trends when the buffer size was reduced to 60KB. Reducing the link bandwidth to 10Gbps resulted in at most 3.5% increase in RCT due to disabling PFC. Increasing the bandwidth to 100Gbps resulted in at most 9% increase in RCT without congestion control and at most 4.6% with congestion control due to disabling PFC. We also simulated another incast scenario with 16 senders sending 1MB of data per connection to a single destination (all attached to the same ToR), and varied the number of connections per sender from 4 to 10 as in [29]. This resulted in at most 7% increase in RCT without PFC. Incast with cross traffic: The previous evaluation was the best-case scenario for PFC, where there was no cross traffic, so there was no negative performance impact of congestion spreading. In practice we expect incast to occur with other cross traffic in the network [24, 29]. We 9 The exception was when IRN was used without any congestion control for a 10 to 1 incast, where the increase in RCT without PFC was 15%. 7

8 started an incast as described above with M = 30, along with our default case workload running at 50% link utilization levels. Disabling PFC with IRN lowered the incast RCT by 40% without congestion and by 7% with Timely, and increased it by only 1.15% with DCQCN. The incast RCT for IRN without PFC was lower than RoCE + PFC by 4%-30% across the three congestion control schemes. For the background workload, PFC-disabled IRN s performance was better than both IRN + PFC (by 1%-75%) and RoCE + PFC (by 32%-87%) across the three congestion control schemes and the three metrics (i.e., the average slowdown, the average and the tail ). Summary: Our key results i.e., (1) IRN does not require PFC, and (2) IRN without PFC performs better than current RoCE NICs with PFC, hold across varying realistic scenarios, congestion control schemes and performance metrics. 4 IRN Implementation Considerations We now discuss how one can implement IRN s transport logic in an RDMA NIC, while maintaining the correctness of RDMA semantics as defined by the Infiniband RDMA specification [5]. Our implementation relies on extensions to RDMA s packet format, e.g., introducing new fields and packet types. These extensions are encapsulated within IP and UDP headers (as in RoCEv2) so they only effect the endhost behavior and not the network behavior (i.e. no changes are required at the switches). We provide a glossary for various RDMA-specific terms in B of the appendix. 4.1 Supporting RDMA Reads and Atomics IRN relies on per-packet ACKs for BDP-FC and loss recovery. Current RoCE NICs already support per-packet ACKs for RDMA Writes and Sends. However, when using RDMA Reads, the requester (which is the data sink) does not explicitly acknowledge response packets. IRN, therefore, introduces packets for read (N)ACKs that are sent by a requester for each Read response packet. RoCE currently has eight unused opcode values available for the reliable connected (RC) QPs, and we use one of these for read (N)ACKs. IRN also requires the responder to implement timeouts for read responses. 10. RDMA Atomic operations are treated similar to single-packet RDMA Read message. Our simulations from 3.1 did not use ACKs for the RoCE + PFC baseline 11, modelling the extreme case of all Reads. Therefore our results from 3 take into account the additional overhead of per-packet ACKs in IRN. 10 New timer-driven actions (such as, for implementing DCQCN) have been added to the NIC hardware implementation in the past [33] 11 We enabled per-packet ACKs for Timely which were required to compute RTTs. 4.2 Supporting Out-of-order Packet Delivery One of the key challenge for implementing IRN is supporting out-of-order (OOO) packet delivery at the receiver current RoCE NICs simply discard OOO packets. A naive approach for handling OOO packet would be to store all of them in NIC memory. The total number of OOO packets with IRN is bounded by the BDP cap (which is about 110 MTU-sized packets for our default scenario as described in 3.2) 12 Therefore to support a thousand flows, a NIC would need to buffer 110MB of packets, which exceeds the memory capacity on most commodity RDMA NICs. We therefore explore an alternate implementation strategy, where the NIC DMAs OOO packets directly to the final address in the application memory and keeps track of them using bitmaps (which are sized at BDP cap). This reduces NIC memory requirements from 1KB per OOO packet to only a couple of bits. However, it introduces some additional challenges. While we discuss them in greater detail below, at a high-level they are as follows: (1) For many RDMA operations, critical information is carried in the first packet (which is required to process other packets in the message) and/or in the last packet (which is required to complete message processing). Enabling OOO delivery therefore requires that (a) some of the information in the first packet be carried by all packets, and (b) the relevant information in the last packet be stored at the endpoint (either on NIC or main memory) for future use. (2) Some operations require packets to be matched with the corresponding WQE, which is done implicitly for inorder packet arrivals. But this implicit matching breaks with OOO packet arrival, creating a need for an explicit WQE sequence number to identify the corresponding WQE for each packet. Note that partial support for out-of-order packet delivery was introduced in the Mellanox ConnectX-5 NICs to enable adaptive routing [12]. However, it is restricted to RDMA Write and Read operations. We improve and extend this design to support all RDMA operation types in IRN Placing out-of-order requests packets: We now describe the challenges associated with placing out-of-order request packets and how IRN resolves them. RDMA Write: In current RoCE NICs only the first packet of an RDMA Write message carries the RETH header, which contains the remote memory location. If this is lost, the receiver would not know where to place subsequent data. IRN, therefore, requires putting the 12 For QPs that only send single packet messages less than one MTU in size, the number of outstanding packets is limited to the maximum number of outstanding requests, which is typically much smaller than the BDP cap [26, 27]. 8

9 RETH header extension in every packet. Send/Receive: InfiniBand specification requires that Receive WQEs (which contain the destination address for Send operations) be consumed in the order they are posted. With IRN every receive WQE and Send/Write with Immediate request WQE maintains a recv WQE SN, which is carried by all Send packets and by the last Write with Immediate packet. This can be used by an OOO Send packet to identify the correct receive WQE. IRN also requires the Send packets to carry the relative offset in the packet sequence number, which is used to identify the precise address when placing data. RDMA Read/Atomic: RDMA application semantics require the responder to not begin processing a Read- /Atomic request R, until all packets expected to arrive before R have been received. Therefore, an OOO Read- /Atomic request packet needs to be placed in a Read WQE buffer (which is already maintained by current RoCE NICs). With IRN, every Read/Atomic WQE maintains a read WQE SN, that is carried by all Read/Atomic request packets and allows identification of the correct index in the Read WQE buffer Signalling Completion The responder, in current RoCE NICs, maintains a message sequence number (MSN) which gets incremented when the last packet of a Write/Send message is received or when a Read/Atomic request is received. This MSN value is sent back to the requester in the ACK packets and is used to expire the corresponding request WQEs. The responder expires its receive WQE when the last packet of a Send or a Write With Immediate message is received and generates a CQE that contains certain meta-data about the transfer carried in the last packet. IRN s responder deals with OOO arrival for the last packet by maintaining a 2-bitmap, which in addition to tracking whether or not a packet p has arrived, also tracks whether (1) it will trigger an MSN update and (2) it will trigger a receive WQE expiration. For the latter case, the recv WQE SN carried by p (as discussed above) can identify the receive WQE with which the meta-data in p needs to be associated, thus enabling a premature CQE creation. The premature CQE can be stored in the main memory, until it gets delivered to the application after all packets upto p have been received. A more detailed description for this has been provided in C.1 of the appendix Application-level Assumptions Certain applications (for example FaRM [22]) rely on polling the last packet of a Write message to detect completion, which is incompatible with OOO data placement. This polling based approach violates the RDMA specification (Sec o9-20 [5]) and is more expensive than officially supported methods (FaRM [22] mentions moving on to using Write with Immediate in the future for better scalability). IRN s design provides all of the Write completion guarantees as per the RDMA specification. C.2 of the appendix discusses this in more details. OOO data placement can also result in a situation where data written to a particular memory location is overwritten by a restransmitted packet from an older message. Typically, applications using distributed memory frameworks assume relaxed memory ordering and use application layer fences whenever strong memory consistency is required [14, 35]. Therefore, both iwarp and Mellanox ConnectX-5, in supporting OOO data placement, expect the application to deal with the potential memory overwriting issue and do not handle it in the NIC or the driver. IRN can adopt the same strategy. Another alternative is to deal with this issue in the driver, by enabling the fence indicator for a newly posted request that could potentially overwrite an older one. 4.3 Other Considerations Currently, the same packet sequence number (PSN) counter is used to track request and response packets. This can result in sudden jumps in PSN, interfering with loss tracking and BDP-FC. IRN, therefore, splits the PSN counter into two different ones (1) spsn to track the request packets, and (2) rpsn to track the response packets. This decoupling remains transparent to the application and does not require any packet format changes. A detailed example for this is provided in C.3 of the appendix. IRN can also support shared receive queues and send with invalidate operations and is compatible with use of end-to-end credit (discussed in more details in C.4-C.7 of the attached appendix). 5 Implementation Feasibility We evaluate the feasibility of implementing IRN by breaking up the implementation overheads into three categories: (1) in 5.1, we do a comparative analysis of IRN s memory requirements; (2) in 5.2, we evaluate the overhead for implementing IRN s packet processing logic by synthesizing it on an FPGA; and (3) in 5.3, we evaluate via simulations how IRN s implementation-specific choices impact the end-to-end performance. 5.1 NIC State overhead due to IRN IRN s implementation requires some additional state on the NIC, that we summarize below. Additional Per-QP Context: State variables: IRN needs a total of 52 bits of additional state for its transport logic: 24 bits each for retransmit sequence and recovery sequence, and 4 bits for various flags. Other per-flow state variables needed for implementing IRN s transport logic (such as the expected sequence number, the next sequence to send and so on) are already maintained by current RoCE NICs. This accounts for 9

10 104 bits per QP (52 bits each at the requester and the responder). Maintaining a timer at the responder for Read timeouts and a variable to track incomplete Read request adds another 56 bits to the responder leading to a total of 160 bits (or 20 bytes) of additional per-qp state with IRN. For context, RoCE NICs currently maintain a few thousands of bits of per QP for various state variables [3]. Bitmaps: IRN requires five BDP sized bitmaps: two at the responder for the 2-bitmap to track request packets, one at the requester to track the Read responses, one each at the requester and responder for tracking selective acks. Assuming each bitmap to be 128 bits (i.e., sized at the BDP cap for a network with bandwidth 40Gbps and a twoway propagation delay of up to 24µs, typical in today s datacenter topologies [29]) 13, IRN would require a total of 640 bits (80 bytes) per QP for bitmaps. This is much less than the total size of bitmaps maintained by a QP for the OOO support in Mellanox ConnectX-5 NICs [3]. Others: Other per-qp meta-data that is needed by an IRN driver when a WQE is posted (e.g the counters used for assigning WQE SNs) or when it is expired (e.g. premature CQEs) can be stored directly in the main memory and do not add to the NIC memory overhead. Additional Per-WQE Context: As described in 4, IRN requires certain types of WQEs to maintain WQE SNs, which adds a total of 3 bytes to the per-wqe context (currently sized at 64 bytes [3]). Additional Shared State: IRN also maintains some additional variables (or parameters) that are shared across QPs. This includes the BDP cap value, the RTO low value, and N for RTO low, which adds up to a total of 10 bytes. Mellanox RoCE NICs support several MBs of cache to store various metadata including per-qp and per-wqe contexts [3]. This implies that the additional state added by IRN only consumes 3-10% of the NIC cache size for a couple of thousands of QPs and tens of thousands of WQEs (even when the size of the bitmap is increased to support 100Gbps links). 5.2 IRN s packet processing overhead We now evaluate the feasibility of implementing IRN s per-packet processing logic, which requires various bitmap manipulations and can be identified as the key functionality that differs from what is currently implemented in RoCE NICs. The basic provisions required for implementing other IRN changes (such as adding header extensions to packets, or premature CQE generation) are already implemented in RoCE NICs and can be easily extended for IRN. We use Xilinx Vivado Design Suite [4] to do a high-level synthesis of the four key packet processing modules (as described below), targeting Kintex Ultrascale 13 We pad the bitmap to a multiple of 32, for ease of implementation as described in 5.2 XCKU060 FPGA, which is supported as a bump-on-thewire on Mellanox Innova Flex 4 10/40Gbps NICs [13] Synthesis Process To focus on the additional packet processing complexity due to IRN, our implementation for the four modules is stripped-down. More specifically, each module receives the relevant packet metadata and the QP context as streamed inputs, relying on a RoCE NIC s existing implementation to parse the packet headers and retrieve the QP context from the NIC cache (or the system memory, in case of a cache miss). The updated QP context is passed as streamed output from the module, along with other relevant module-specific outputs as described below. (1) receivedata: Triggered on a packet arrival, it outputs the relevant information required to generate an ACK/- NACK packet and the number of premature CQEs that can be expired, along with the updated QP context (e.g. bitmaps, expected sequence number, MSN). (2) txfree: Triggered when the link s Tx is free for the QP to transmit, it outputs the sequence number of the packet to be (re-)transmitted and the updated QP context (e.g. next sequence to transmit). During loss-recovery, it also performs a look ahead by searching the SACK bitmap for the next packet sequence to be retransmitted. (3) receiveack: Triggered when an ACK/NACK packet arrives, it outputs the updated QP context (e.g. SACK bitmap, last acknowledged sequence). (4) timeout: If triggered when the timer expires using RTO low value (indicated by a flag in the QP context), it checks if the condition for using RTO low holds. If not, it does not take any action and sets an output flag to extend the timeout to RTO high. In other cases, it executes the timeout action and returns the updated QP context. Our implementation relies on existing RoCE NIC s support for setting timers, with the RTO low value being used by default, unless explicitly extended. The bitmap manipulations in the first three modules account for most of the complexity in our synthesis. Each bitmap was implemented as a ring buffer, using an arbitrary precision variable of 128 bits, with the head corresponding to the expected sequence number at the receiver (or the cumulative acknowledgement number at the sender). The key bitmap manipulations required by IRN can be reduced to the following three categories of known operations: (i) finding first zero, to find the next expected sequence number in receivedata and the next packet to retransmit in txfree (ii) popcount to compute the increment in MSN and the number of CQEs to be expired in receivedata, (iii) bit shifts to advance the bitmap heads in receivedata and receiveack. We optimized the first two operations by dividing the bitmap variables into chunks of 32 bits and operating on these chunks in parallel. We validated the correctness of our implementation by 10

Revisiting Network Support for RDMA

Revisiting Network Support for RDMA Radhika Mittal 1, Alex Shpiner 3, Aurojit Panda 1, Eitan Zahavi 3, Arvind Krishnamurthy 2, Sylvia Ratnasamy 1, Scott Shenker 1 (1: UC Berkeley, 2: Univ. of Washington,