Revisiting Network Support for RDMA

Size: px
Start display at page:

Download "Revisiting Network Support for RDMA"

Transcription

1 Revisiting Network Support for RDMA Radhika Mittal, Alex Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, Scott Shenker UC Berkeley, ICSI, Mellanox, Univ. of Washington Abstract With the advent of RoCE (RDMA over Converged Ethernet), recent years have witnessed a significant increase in the usage of RDMA in Ethernet-based datacenter networks. RoCE NICs only achieve good performance when run over a lossless network, which is obtained by enabling Ethernet s Priority Flow Control (PFC) mechanism. However, PFC introduces significant problems, such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. In this paper, we ask: is PFC fundamentally required for deploying RDMA over Ethernet, or is their use merely an artifact of the current RoCE NIC design? We find that while PFC is indeed needed for current RoCE NICs, it is unnecessary (and sometimes significantly harmful) when one updates RoCE NICs to a more appropriate (yet still feasible) design. We, thus, propose a new improved RoCE NIC (IRN) design, that does not require PFC and performs better than current RoCE NICs. Our results hold across different experimental scenarios and across different (and even without any) congestion control algorithms. Therefore, our results suggest that to avoid the many problems with PFC, we should adopt this new IRN design for running RDMA in datacenters. 1 Introduction Datacenter networks, when compared to more traditional wide-area networks, offer higher bandwidths and lower latencies. However, traditional endhost networking stacks, with their high latencies and substantial CPU overhead, have limited the extent to which applications can make use of these characteristics. As a result, several large datacenters have recently adopted RDMA which bypasses the traditional networking stacks in favor of direct memory accesses [24, 29, 37]. RDMA over Converged Ethernet (RoCE) has emerged as the canonical method for deploying RDMA in Ethernetbased datacenters [24, 37]. The centerpiece of RoCE is a NIC that (i) provides mechanisms for accessing host memory without CPU involvement and (ii) supports very basic network transport functionality. Early experience revealed that RoCE NICs only achieve good end-to-end performance when run over a lossless network, so operators turned to Ethernet s Priority Flow Control (PFC) mechanism to achieve minimal packet loss. The combination of RoCE and PFC has enabled a wave of datacenter RDMA deployments. However, the current solution is not without problems. In particular, PFC adds management complexity and can lead to significant performance problems such as head-ofthe-line blocking, congestion spreading, and occasional deadlocks [24, 25, 34, 36, 37]. Rather than proceed along the current path and address the various problems with PFC, in this paper we take a step back and ask whether it was needed in the first place. To be clear, current RoCE NICs require a lossless fabric for good performance. However, the question we raise is: can the RoCE NIC design be altered so that losslessness is not needed? We answer this question in the affirmative and propose a new design called IRN (for Improved RoCE NIC) that makes two key changes to current RoCE NICs (i) more efficient loss recovery, and (ii) basic end-to-end flow control that bounds the number of in-flight packets by the bandwidth-delay product of the network. 1 We show, via extensive simulations, that IRN does not require PFC, yet performs better than current RoCE NICs with PFC ( 3). While our design requires extensions to the RDMA packet format ( 4), we show that it is relatively easy to implement in hardware by synthesizing IRN s packet processing logic to target an FPGA device supported by Mellanox Innova Flex 4 NICs and by comparing IRN s memory requirements to those of current RoCE NICs ( 5). Before describing our design, we first review some background. 1 These changes are orthogonal to the congestion control mechanisms that have been implemented on top of RoCE NICs (e.g., DCQCN [37] and Timely [29]). 1

2 2 Background 2.1 Evolution of RDMA RDMA has long been used by the HPC community in special-purpose Infiniband clusters that use credit-based flow control to make the network lossless [5]. Because packet drops are rare in such clusters, the RDMA Infiniband transport as implemented on the NIC was not designed to efficiently recover from packet losses. When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender. When the sender receives a NACK or experiences a timeout it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-n retransmission). To take advantage of the widespread use of Ethernet in datacenters, RoCE [6, 10] was introduced to enable the use of RDMA over Ethernet. 2 RoCE adopted the same Infiniband transport design (including go-back-n loss recovery) and the network was made lossless using PFC. Early versions of RoCE were shipped without any in-built congestion control mechanism (and instead used a static rate for each flow). When it became clear that PFC had significant problems, congestion control mechanisms [29, 37] were proposed for RoCE that reduced the sending rate on detecting congestion, thus reducing the queue build-up and, consequently, the negative impact of PFC. In particular, DCQCN [37], an ECN-based congestion control scheme, was added as a feature on Mellanox RoCE NICs [33], and Timely [29] was proposed as an application-layer delay-based scheme which could be used with RoCE. RoCE is not the only approach to deploying RDMA over Ethernet. iwarp [32] was designed to support RDMA over a generic (non-lossless) network. It implements the entire TCP stack in hardware along with multiple other layers that it needs to translate TCP s byte stream semantics to RDMA segments. It is significantly more complex than Infiniband-based RDMA, and early iwarp implementations had suboptimal performance, making RoCE the current canonical choice for deploying RDMA over Ethernet. Our approach to improving RoCE is to adopt the architectural approach of iwarp (having the NIC handle basic loss recovery) but the implementation approach of RoCE (adding only a minimal set of additional features to current RoCE NICs rather than putting an entire set of protocols in hardware). 2.2 Priority Flow Control Priority Flow Control (PFC) [7] is an Ethernet flow control mechanism; a switch sends a pause (or X-OFF) mes- 2 We use the term RoCE for both RoCE [6] and its successor Ro- CEv2 [10] that enables running RDMA, not just over Ethernet, but also over IP-routed networks. sage to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold. When the queue drains below the threshold, an X-ON message is sent to resume transmission. When configured correctly, PFC can result in lossless networks (as long as all network elements remain functioning). Over the last couple of years, numerous papers have been published highlighting the drawbacks of PFC [24,25, 34, 36, 37] which range from mild (e.g., unfairness, headof-line blocking, collateral damage) to severe, such as pause spreading as highlighted in [24] 3 and the deadlock problems in [25, 34, 36]. While none of these problems are definitively fatal, there is wide agreement that PFC makes networks harder to understand and manage, and can lead to myriad performance problems. 3 Is a lossless network needed for good performance? In this section we show that the need for PFC is merely an artifact of the current RoCE NIC design (in particular, its transport logic). We begin with describing the transport logic for an Improved RoCE NIC (IRN) in 3.1. For simplicity, we present it as a general design independent of the specific RDMA operation (be it Write, Read, Send or Atmomic). We go into the details of handling specific RDMA operations with IRN later in 4. We then describe our experimental set up in 3.2, present our basic results in 3.3 and analyze them in more detail in IRN Transport Logic As discussed in 2, current RoCE NICs use a go-back-n loss recovery scheme. As we will see in 3.3, redundant retransmissions caused by go-back-n loss recovery, in the absence of PFC, negatively impact performance. Therefore, the first change we make with IRN is to make the loss recovery more efficient, based on selective retransmission (inspired by TCP s loss recovery), where the receiver does not discard out of order packets 4 and the sender selectively retransmits the lost packets. The second change we make with IRN is introducing the notion of a basic end-to-end packet level flow control, where we cap the number of outstanding packets in flight for a flow (or a QP) by the bandwidth-delay product (BDP) of the network as suggested in [17]. This is a static cap that we compute by dividing the BDP (in bytes) with the packet MTU set by the QP (typically 1KB in RoCE NICs). We call this BDP-FC (BDP Flow Control), which improves the performance by reducing unnecessary queuing in the network. Furthermore, it provides a strict upper bound on the number of out-of-order packets that can be seen 3 Examples include a single malfunctioning NIC that can pause the entire network and a particular reaction between PFC and Ethernet flooding that results in a network deadlock. 4 We discuss how the out-of-order packets are handled by the receiver in details in 4. 2

3 by the receiver, which, as we will see later in 5, eases the state management in the NICs. We elaborate on the performance impact of these changes in 3.4. The IRN loss recovery and BDP-based flow control mechanisms are largely orthogonal to the specific congestion control algorithm (such as DCQCN [37] and Timely [29]) that can be used on top of it, as with current RoCE NICs. Moreover, IRN, like RoCE transport, operates directly on the RDMA segments using packet semantics. We describe both IRN s flow control mechanism (BDP- FC) and loss recovery mechanism in greater detail below: BDP-FC: An IRN sender transmits a new packet only if the number of packets in flight (computed as the difference between current packet s sequence number and last acknowledged sequence number) is less than the BDP cap. It requires the IRN receiver to send per-packet ACKs. Loss Recovery: Upon every out-of-order packet arrival, an IRN receiver sends a NACK, which carries both the cumulative acknowledgment number (indicating its expected sequence number) and the sequence number of the packet that triggered the NACK (as a simplified form of selective acknowledgement). An IRN sender enters loss recovery mode when a NACK is received or when a timeout occurs. It maintains a bitmap to track which packets have been cumulatively and selectively acknowledged, which gets updated by subsequent NACKs that are received. When in the loss recovery mode, rather than sending new packets when the Tx is free, the sender selectively retransmits lost packets as indicated by the bitmap. The first packet that is retransmitted on entering loss recovery corresponds to the cumulative acknowledgement value. Any subsequent packet is considered lost only if another packet with a higher sequence number has been selectively acked. The sender continues to transmit new packets (if allowed by BDP-FC) if there are no more lost packets to retransmit. It exits loss recovery when a cumulative acknowledgement greater than the recovery sequence is received, where the recovery sequence corresponds to the last regular packet that was sent before the retransmission of a lost packet. NACKs with selective acknowledgement allow efficient loss recovery only when there are multiple packets in flight. For other cases (example, for single packet messages), loss recovery gets triggered via timeouts. A high RTO timer value can increase the tail latency of these short single-packet messages. However, keeping the RTO timer value too small, can result in too many spurious retransmissions, affecting the overall results. An IRN sender, therefore, uses a low timer value of RTO low only when there are a small N number of packets in flight and a higher value of RTO high otherwise. 5 We discuss how the timeout feature in current RoCE NICs can be easily 5 We also experimented with dynamically computed timeout values extended to support this in 5.2. As mentioned before, IRN s loss recovery has been inspired by TCP s loss recovery. However, rather than incorporating the entire TCP stack as is done by iwarp NICs, IRN (1) decouples loss recovery from congestion control and does not incorporate any notion of TCP window control involving slow start, AIMD or advanced fast recovery (2) operates directly on RDMA segments instead of using a byte stream abstraction as in TCP, which allows IRN to simplify the selective acknowledgement scheme and tracking packet losses (as we will discuss in 5.2). 3.2 Experimental Settings Simulator: Our simulator, obtained from a commercial NIC vendor, extends INET/OMNET++ [1, 2] to model the Mellanox ConnectX4 RoCE NIC [11]. RDMA Queue Pairs (QPs) are modelled as UDP applications with either RoCE or IRN transport layer logic, that generate flows (whose sizes are determined by the input workload as described later). We define a flow as a unit of data transfer comprising of one or more messages between the same source-destination pair as in [29, 37]. When the sender QP is ready to transmit data packets, it periodically polls the MAC layer until the link is available for transmission. The simulator implements the version of DCQCN implemented by the Mellanox ConnectX-4 ROCE NIC [33], and we add support for a NIC-based Timely implementation. When using DCQCN congestion control with IRN we include the ECN echo in the IRN ACK/NACK packets. For DCQCN with RoCE, where ACK/NACKs are not sent for every packet, we use a separate CNP packet as in [33, 37]. Finally, in our simulation, all switches are input-queued with virtual output ports, that are scheduled using round-robin. The switches can be configured to generate PFC frames by setting appropriate buffer thresholds. Default Case Scenario: For our default case, we simulate a 54-server three-tiered fat-tree topology, connected by a fabric with full bisection-bandwidth constructed from 45 6-port switches organized into 6 pods [16]. We consider 40Gbps links with a propagation delay of 2µs, resulting in a bandwidth-delay product (BDP) of 120KB along the longest (6-hop) path. This corresponds to about 111 MTU-sized packets (assuming typical RDMA MTU of 1KB and adding headers). Each end host generates new flows with Poisson interarrival times [17, 30]. Each flow s destination is picked randomly and size is drawn from a realistic heavy-tailed distribution derived from [19]. Most flows are small (50% of the flows are single packet messages with sizes ranging between 32 bytes-1kb representing small RPCs such as those generated by RDMA based key-value stores [22, (as with TCP), which not only complicated the design, but did not help with this case, since these effects were then be dominated by the initial timeout value. 3

4 PFC No PFC RoCE RoCE + Timely RoCE + DCQCN. (s) PFC No PFC RoCE RoCE + Timely RoCE + DCQCN (s) PFC No PFC RoCE RoCE + Timely RoCE + DCQCN Figure 1: The figures compare the PFC-enabled vs PFC-disabled performance for RoCE transport with and without explicit congestion control schemes along three metrics: average slowdown, average, and. PFC IRN No PFC (s) PFC IRN No PFC (s) PFC IRN No PFC 18 PFC IRN+Timely No PFC IRN+DCQCN (s) PFC IRN+Timely No PFC IRN+DCQCN (s) PFC IRN+Timely No PFC IRN+DCQCN (a) IRN (b) IRN with explicit congestion control Figure 2: The figures compare the PFC-enabled vs PFC-disabled performance of IRN when used (a) without any explicit congestion control and (b) with explicit congestion control schemes using three metrics (average slowdown, average and ). Note that the y-axis range is different from that in Figure 1. 26]), and most of the bytes are in large flows (15% of the flows are between 200KB-3MB, representing background RDMA traffic such as storage). The network load is set at 70% utilization for our default case. We use ECMP for load-balancing [24]. We vary different aspects from our default scenario (including topology size, workload pattern and link utilization) in 3.4. Parameters: For IRN, we set RTO low to 100µs (with N = 3). RTO high is set to an estimation of the maximum round trip time with one congested link. We compute this as the sum of the propagation delay on the longest path and the maximum queuing delay a packet would see if the switch buffer on a congested link is completely full. This is approximately 320µs for our default case. When using RoCE without PFC, we use a fixed timeout of value RTO high. We disable timeouts when PFC is enabled to prevent spurious retransmissions. For our default case, we use buffers sized at twice the BDP of the network (which is 240KB in our default case) for each input port [17, 18]. The PFC threshold at the switches is set to the buffer size minus a headroom equal to the upstream link s bandwidthdelay product (needed to absorb all packets in flight along the link). This is 220KB for our default case. We vary these parameters in 3.4 to show that our results are not very sensitive to these specific choices. When using RoCE or IRN with Timely or DCQCN, we use the same congestion control parameters as specified in [29] and [37] respectively. For fair comparison with PFC-based proposals [36, 37], the flow starts at line-rate for all cases (with and without congestion control). Methodology: We run the simulation for a fixed duration (100ms by default). We then consider all flows that started within a specified time interval in this duration (12.5ms to 50ms for our default case, resulting in about 25 thousand flows) and compute the following three metrics: (i) average slowdown, where slowdown for a flow is its completion time divided by the time it would have taken to traverse its path at line rate in an empty network, (ii) average flow completion time (), (iii) or tail flow completion time. While the average and tail s are dominated by the performance of throughput-sensitive flows, the average slowdown is dominated by the performance of latency-sensitive short flows. 3.3 Basic Results We now confront the central question of this paper: Does RDMA require a lossless network? If the answer is yes, then we must address the many difficulties of PFC. If the answer is no, then we can leave PFC behind and utilize common network infrastructure without further complications. We begin with some basic results for our default scenario Performance with Current RoCE NICs Figure 1 shows the performance of current RoCE NICs with and without PFC (and with various forms of congestion control). The first set of bars in each graph uses the current RoCE design as described in 2 with no explicit congestion control. The use of PFC helps considerably. Removing PFC increases the average slowdown, the average and the tail by more than 2.5, 3, and 1.5 respectively. The last two set of bars in each graph use RoCE with Timely and RoCE with DCQCN. PFC continues to be important, with performance improvements ranging from 1.35 to 3.5 across the three metrics. Thus, it is clear that PFC provides significant benefits for current RoCE NICs. This is largely because of the goback-n loss recovery used by current RoCE NICs, which penalizes performance due to (1) increased congestion created by redundant retransmissions and (2) the time (and bandwidth) wasted by flows in sending redundant packets when recovering from losses. 4

5 RoCE + PFC IRN/RoCE + Timely + DCQCN. (s) RoCE + PFC IRN/RoCE + Timely + DCQCN (s) RoCE + PFC IRN/RoCE + Timely + DCQCN Figure 3: The figures compare the PFC-enabled RoCE performance vs PFC-disabled IRN performance with and without different congestion control schemes using three metrics (average slowdown, average and ) Performance with IRN We now evaluate whether or not the need for PFC is a fundamental requirement for good performance by comparing IRN s performance with and without PFC. In Figure 2(a) we show the results of this comparison when no explicit congestion control is used. Remarkably, we find that not only is PFC not required, but it significantly degrades IRN s performance (increasing the value of each metric by about ). This is because of the head-of-the-line blocking and congestion spreading issues PFC is notorious for: pauses triggered by congestion at one link, cause queue build up and pauses at other upstream entities, creating a cascading effect. Note that packet drops also negatively impact performance, since it takes about one round trip time to detect a packet loss and another round trip time to recover from it. However, the negative impact of a packet drop (given efficient loss recovery), is restricted to the flow that faces congestion and does not spread to other flows, as in the case of PFC. While these PFC issues have been observed before [24, 29, 37], we believe our work is the first to show that given efficient loss recovery, PFC can have a greater negative impact on performance than packet drops. The previous comparison showed that PFC is not necessary when using IRN without any explicit congestion control mechanism. However, as mentioned before, RoCE today is typically deployed in conjunction with some explicit congestion control mechanism such as Timely and DCQCN. Figure 2(b) shows the performance with and without PFC when we use these congestion control mechanisms on top of our IRN transport base. 6 We find that IRN s performance is largely independent of whether PFC is enabled or disabled the largest negative impact of disabling PFC was less than 1% and the largest improvement due to disabling PFC was about 3.5%. This is because, with explicit congestion control, both the PFC rate reduces (reducing the impact of congestion spreading) and the number of packet drops reduces, thus, reducing the difference between PFC-enabled and PFC-disabled performance. We report the corresponding drop rates and 6 Note that while the average and tail was slightly smaller when IRN was used without any congestion control algorithm, the average slowdown is significantly lower when IRN is used with Timely and DC- QCN. This is because both Timely and DCQCN aim to achieve smaller queues (and therefore, smaller packet latencies), while compromising a little on the throughput IRN + PFC IRN + No PFC IRN w/ Go-Back-N + No PFC IRN IRN + Timely IRN + DCQCN Figure 4: The figure compares the average for IRN with go-back-n loss recovery, IRN with the default selective retransmit based loss recovery and IRN with PFC, across different congestion control algorithms. The y-axis is capped at 0.003s to better highlight the trends.. (s) PFC rates in A.1 of the attached appendix Current RoCE NICs vs IRN The previous results compared IRN s performance with and without PFC, showing that it does not require losslessness. Figure 3 compares the performance of IRN without PFC vs current RoCE NIC with PFC. We find that IRN without PFC performs about better than RoCE + PFC, across all metrics and congestion control schemes. This is because IRN s BDP-FC mechanism reduces unnecessary queuing and disabling PFC eliminates congestion spreading issues. Key Takeaways: The following are the three takeaways from these results: (1) Current RoCE NICs require PFC to achieve good performance (2) IRN does not require PFC, (3) IRN performs better than current RoCE NICs. We analyze these results in more details next. 3.4 Detailed Analysis We now explore these basic results in more details, by performing a factor analysis of IRN and evaluating the robustness of our trends across different experimental scenarios IRN Factor Analysis: As discussed in 3.1, IRN makes two key changes to the RoCE NIC: (1) efficient loss recovery and (2) BDP-FC. We now analyze the significance of these two changes. Need for Efficient Loss Recovery: To study the significance of efficient loss recovery, we enabled go-back-n loss recovery for IRN and compared its performance with our default selective retransmit based IRN and with IRN + PFC (the latter two data points being the same as those evaluated in 3.3). Figure 4 shows the average for the three cases. We find that IRN with go-back-n loss recovery performs significantly worse than both IRN with 5

6 IRN + PFC IRN without BDP-FC + PFC IRN + No PFC IRN without BDP-FC + No PFC IRN IRN + Timely IRN + DCQCN Figure 5: The figures compares the average for PFC-enabled and PFC-disabled peformance of IRN without BDP-FC, across different congestion control algorithms.. (s) the default selective retransmit loss recovery and IRN with PFC. We saw similar trends for average slowdown and tail. This is because of the bandwidth wasted by go-back-n due to redundant retransmissions. Before converging to the current IRN loss recovery mechanism (based on simplified SACKs), we also experimented with other simpler loss recovery mechanisms. In particular we explored the following two questions: (1) Can go-back-n be made more efficient? We found that explicitly backing off on losses (by reducing the rate by half whenever a packet loss is detected) reduced the negative impact of go-back-n for Timely (though, not for DCQCN). Nonetheless, selective retransmit based loss recovery continued to perform significantly better across different scenarios (with the difference in average for Timely ranging from 20%-50%). (2) Do we need SACKs? We also tried a selective retransmit scheme without SACKs (where the sender does not maintain a bitmap to track selective acknowledgements). This performed better than go-back-n. However, it fared poorly when there were multiple losses in a window, requiring multiple round-trips to recover from them. The corresponding degradation in average ranged from <1% to as high as 75% across different scenarios when compared to IRN s SACK-based loss recovery. Significance of BDP-FC: We next disabled BDP-FC for IRN (for both PFC-enabled and PFC-disabled case) 7 and compared its performance with BDP-FC enabled IRN (as evaluated in 3.3). Figure 5 reports the resulting average s. We find that BDP-FC significantly improves performance by reducing unnecessary queuing (along with packet drops or PFC frames). Further, for the PFCdisabled case, BDP-FC prevents a flow that is recovering from a packet loss from sending additional new packets and increasing congestion, until the acknowledgement number has been advanced after loss recovery. Notice that, for our default case, IRN s performance without PFC continues to be better than with PFC when BDP-FC is disabled. This is because the congestion spreading issue is further intensified without BDP-FC. However, we will later discuss a specific scenario (incast) where BDP-FC plays a crucial role in getting good performance without PFC. 7 IRN with PFC and no BDP-FC is essentially the same as RoCE + PFC with per-packet acks Robustness across varying scenarios We now discuss the robustness of IRN, as the experimental scenario was varied from our default case. We experimented with (1) link utilization levels varied between 30%-90%, (2) link bandwidths varied from the default case of 40Gbps to 10Gbps and 100Gbps, (3) larger fattree topologies with 128 and 250 servers, (4) a different workload with flow sizes uniformly distributed between 500KB to 5MB, representing background and storage traffic for RDMA, (5) the per-port buffer size varied between 60KB-480KB, (6) varying other IRN parameters (increasing RTO high value by upto 4 times the default of 320µs, increasing the N value for using RTO low to 10 and 15). We present a brief summary of our results here and provide the detailed results for each of these scenarios in A.2 of the attached appendix. Overall Results: Our results across these different scenarios have been summarized below. (a) When IRN was used without any congestion control, disabling PFC always improved performance, with the maximum improvement in performance across different scenarios being as high as 71%. (b) When IRN was used with Timely and DCQCN, disabling PFC often resulted in better performance (with the maximum improvement in performance being 28% for Timely and 17% for DCQCN). Any degradation in performance due disabling PFC with IRN stayed within 1.6% for Timely and 5.1% for DCQCN. (c) IRN without PFC always performed better than RoCE with PFC, with the performance improvement ranging from 6% to 83% across different cases. Some observed trends: We now list some interesting trends we observed across these scenarios. (a) The benefits of disabling PFC with IRN generally increase with increasing link utilization, as the negative impact of congestion spreading with PFC increases. (b) The benefits of disabling PFC with IRN decrease with increasing bandwidths, since the relative cost of a round trip time spent on detecting and reacting to packet drops also increases. (c) The benefits of disabling PFC with IRN increase with decreasing buffer sizes because of higher number of PFC frames and greater impact of congestion spreading. In general, as the buffer size is increased, the difference between PFC-enabled and PFC-disabled performance with IRN reduces (due to fewer PFC frames and packet drops), while the benefits of using IRN over RoCE+PFC increases (because of greater relative reduction in queuing delay with IRN due to BDP-FC). 8 We also implemented conventional window-based congestion control schemes such as TCP s AIMD and DCTCP with IRN and observed similar trends. In fact, 8 We expect to see similar behaviour in shared buffer switches. 6

7 CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) CDF RoCE + PFC IRN + PFC IRN + No PFC Latency (ms) (a) No CC (b) Timely (c) DCQCN Figure 6: The figures compare the tail latency for single-packet messages for IRN (with and without PFC) and current RoCE NICs with PFC, across different congestion control algorithms. RCT Ratio ( / IRN + PFC) NoCC DCQCN Timely Number of senders (M) Figure 7: The figure shows the ratio of request completion time of incast with IRN without PFC over IRN with PFC for varying degree of fan-ins across congestion control algorithms. when IRN is used with TCP s AIMD logic, the benefits of disabling PFC were even stronger. This is because TCP exploits packet drops as a congestion signal (which is lost when PFC is enabled). We further observed that while the performance penalty for doing go-back-n was smaller for window-based schemes, IRN s selective retransmit based loss recovery scheme continued to perform significantly better across different scenarios. More details on this can be found in A.3 of the attached appendix Tail latency for small messages We now look at the tail latency (or tail ) of the singlepacket messages (with sizes 32 bytes-1kb) from our default scenario, which is another metric that often matters in datacenters [29]. Figure 6 shows the tail CDF of latency (from 90%ile to 99.9%ile) for these messages, across different congestion control algorithms. Our key trends from 3.3 holds even for this metric. This is because IRN without PFC is able to recover from single-packet message losses quickly due to the low RTO low timeout value. With IRN + PFC, these messages end up waiting in the queues for similar (or greater) duration due to pauses and congestion spreading. For all cases, IRN performs significantly better than RoCE + PFC (the latter seeing larger negative impact due to congestion spreading) Incast We now consider different incast scenarios, both with and without cross-traffic. The incast workload without any cross traffic can be identified as the best case for PFC, since only valid congestion-causing flows are paused without unnecessary head-of-the-line blocking. Incast without cross-traffic: We simulate the incast workload on our default topology by striping 150MB of data across M randomly chosen sender nodes that send it to a fixed destination node [17]. We vary M from 10 to 50. We consider the request completion time (RCT) as the metric for incast performance, which is when the last flow completes. For each M, we repeat the experiment hundred times and report the average RCT from these runs. Figure 9 shows the results, comparing IRN with and without PFC. Here we see the reverse trend from our previous results, where IRN without any congestion control was most adversely affected by disabling PFC. Nonetheless, any increase in the RCT due to disabling PFC remained less than 2.5%. The corresponding results comparing IRN with RoCE + PFC, looked very similar to this, with the ratio of RCT of IRN without PFC over RoCE + PFC staying within 0.95 and across different congestion control schemes. In other words, the degradation due to using IRN without PFC (if any) stayed within 2.4%. We present these results in A.3 of the appendix. Robustness: We observed that disabling BDP-FC with IRN for this experiment resulted in mor than 10 higher RCTs. Nonetheless, our results were fairly robust to overestimation of BDP: using a window of (3 BDP) for BDP-FC resulted in at most 6% increase in RCT without PFC for all (except one) 9 data points. We saw similar trends when the buffer size was reduced to 60KB. Reducing the link bandwidth to 10Gbps resulted in at most 3.5% increase in RCT due to disabling PFC. Increasing the bandwidth to 100Gbps resulted in at most 9% increase in RCT without congestion control and at most 4.6% with congestion control due to disabling PFC. We also simulated another incast scenario with 16 senders sending 1MB of data per connection to a single destination (all attached to the same ToR), and varied the number of connections per sender from 4 to 10 as in [29]. This resulted in at most 7% increase in RCT without PFC. Incast with cross traffic: The previous evaluation was the best-case scenario for PFC, where there was no cross traffic, so there was no negative performance impact of congestion spreading. In practice we expect incast to occur with other cross traffic in the network [24, 29]. We 9 The exception was when IRN was used without any congestion control for a 10 to 1 incast, where the increase in RCT without PFC was 15%. 7

8 started an incast as described above with M = 30, along with our default case workload running at 50% link utilization levels. Disabling PFC with IRN lowered the incast RCT by 40% without congestion and by 7% with Timely, and increased it by only 1.15% with DCQCN. The incast RCT for IRN without PFC was lower than RoCE + PFC by 4%-30% across the three congestion control schemes. For the background workload, PFC-disabled IRN s performance was better than both IRN + PFC (by 1%-75%) and RoCE + PFC (by 32%-87%) across the three congestion control schemes and the three metrics (i.e., the average slowdown, the average and the tail ). Summary: Our key results i.e., (1) IRN does not require PFC, and (2) IRN without PFC performs better than current RoCE NICs with PFC, hold across varying realistic scenarios, congestion control schemes and performance metrics. 4 IRN Implementation Considerations We now discuss how one can implement IRN s transport logic in an RDMA NIC, while maintaining the correctness of RDMA semantics as defined by the Infiniband RDMA specification [5]. Our implementation relies on extensions to RDMA s packet format, e.g., introducing new fields and packet types. These extensions are encapsulated within IP and UDP headers (as in RoCEv2) so they only effect the endhost behavior and not the network behavior (i.e. no changes are required at the switches). We provide a glossary for various RDMA-specific terms in B of the appendix. 4.1 Supporting RDMA Reads and Atomics IRN relies on per-packet ACKs for BDP-FC and loss recovery. Current RoCE NICs already support per-packet ACKs for RDMA Writes and Sends. However, when using RDMA Reads, the requester (which is the data sink) does not explicitly acknowledge response packets. IRN, therefore, introduces packets for read (N)ACKs that are sent by a requester for each Read response packet. RoCE currently has eight unused opcode values available for the reliable connected (RC) QPs, and we use one of these for read (N)ACKs. IRN also requires the responder to implement timeouts for read responses. 10. RDMA Atomic operations are treated similar to single-packet RDMA Read message. Our simulations from 3.1 did not use ACKs for the RoCE + PFC baseline 11, modelling the extreme case of all Reads. Therefore our results from 3 take into account the additional overhead of per-packet ACKs in IRN. 10 New timer-driven actions (such as, for implementing DCQCN) have been added to the NIC hardware implementation in the past [33] 11 We enabled per-packet ACKs for Timely which were required to compute RTTs. 4.2 Supporting Out-of-order Packet Delivery One of the key challenge for implementing IRN is supporting out-of-order (OOO) packet delivery at the receiver current RoCE NICs simply discard OOO packets. A naive approach for handling OOO packet would be to store all of them in NIC memory. The total number of OOO packets with IRN is bounded by the BDP cap (which is about 110 MTU-sized packets for our default scenario as described in 3.2) 12 Therefore to support a thousand flows, a NIC would need to buffer 110MB of packets, which exceeds the memory capacity on most commodity RDMA NICs. We therefore explore an alternate implementation strategy, where the NIC DMAs OOO packets directly to the final address in the application memory and keeps track of them using bitmaps (which are sized at BDP cap). This reduces NIC memory requirements from 1KB per OOO packet to only a couple of bits. However, it introduces some additional challenges. While we discuss them in greater detail below, at a high-level they are as follows: (1) For many RDMA operations, critical information is carried in the first packet (which is required to process other packets in the message) and/or in the last packet (which is required to complete message processing). Enabling OOO delivery therefore requires that (a) some of the information in the first packet be carried by all packets, and (b) the relevant information in the last packet be stored at the endpoint (either on NIC or main memory) for future use. (2) Some operations require packets to be matched with the corresponding WQE, which is done implicitly for inorder packet arrivals. But this implicit matching breaks with OOO packet arrival, creating a need for an explicit WQE sequence number to identify the corresponding WQE for each packet. Note that partial support for out-of-order packet delivery was introduced in the Mellanox ConnectX-5 NICs to enable adaptive routing [12]. However, it is restricted to RDMA Write and Read operations. We improve and extend this design to support all RDMA operation types in IRN Placing out-of-order requests packets: We now describe the challenges associated with placing out-of-order request packets and how IRN resolves them. RDMA Write: In current RoCE NICs only the first packet of an RDMA Write message carries the RETH header, which contains the remote memory location. If this is lost, the receiver would not know where to place subsequent data. IRN, therefore, requires putting the 12 For QPs that only send single packet messages less than one MTU in size, the number of outstanding packets is limited to the maximum number of outstanding requests, which is typically much smaller than the BDP cap [26, 27]. 8

9 RETH header extension in every packet. Send/Receive: InfiniBand specification requires that Receive WQEs (which contain the destination address for Send operations) be consumed in the order they are posted. With IRN every receive WQE and Send/Write with Immediate request WQE maintains a recv WQE SN, which is carried by all Send packets and by the last Write with Immediate packet. This can be used by an OOO Send packet to identify the correct receive WQE. IRN also requires the Send packets to carry the relative offset in the packet sequence number, which is used to identify the precise address when placing data. RDMA Read/Atomic: RDMA application semantics require the responder to not begin processing a Read- /Atomic request R, until all packets expected to arrive before R have been received. Therefore, an OOO Read- /Atomic request packet needs to be placed in a Read WQE buffer (which is already maintained by current RoCE NICs). With IRN, every Read/Atomic WQE maintains a read WQE SN, that is carried by all Read/Atomic request packets and allows identification of the correct index in the Read WQE buffer Signalling Completion The responder, in current RoCE NICs, maintains a message sequence number (MSN) which gets incremented when the last packet of a Write/Send message is received or when a Read/Atomic request is received. This MSN value is sent back to the requester in the ACK packets and is used to expire the corresponding request WQEs. The responder expires its receive WQE when the last packet of a Send or a Write With Immediate message is received and generates a CQE that contains certain meta-data about the transfer carried in the last packet. IRN s responder deals with OOO arrival for the last packet by maintaining a 2-bitmap, which in addition to tracking whether or not a packet p has arrived, also tracks whether (1) it will trigger an MSN update and (2) it will trigger a receive WQE expiration. For the latter case, the recv WQE SN carried by p (as discussed above) can identify the receive WQE with which the meta-data in p needs to be associated, thus enabling a premature CQE creation. The premature CQE can be stored in the main memory, until it gets delivered to the application after all packets upto p have been received. A more detailed description for this has been provided in C.1 of the appendix Application-level Assumptions Certain applications (for example FaRM [22]) rely on polling the last packet of a Write message to detect completion, which is incompatible with OOO data placement. This polling based approach violates the RDMA specification (Sec o9-20 [5]) and is more expensive than officially supported methods (FaRM [22] mentions moving on to using Write with Immediate in the future for better scalability). IRN s design provides all of the Write completion guarantees as per the RDMA specification. C.2 of the appendix discusses this in more details. OOO data placement can also result in a situation where data written to a particular memory location is overwritten by a restransmitted packet from an older message. Typically, applications using distributed memory frameworks assume relaxed memory ordering and use application layer fences whenever strong memory consistency is required [14, 35]. Therefore, both iwarp and Mellanox ConnectX-5, in supporting OOO data placement, expect the application to deal with the potential memory overwriting issue and do not handle it in the NIC or the driver. IRN can adopt the same strategy. Another alternative is to deal with this issue in the driver, by enabling the fence indicator for a newly posted request that could potentially overwrite an older one. 4.3 Other Considerations Currently, the same packet sequence number (PSN) counter is used to track request and response packets. This can result in sudden jumps in PSN, interfering with loss tracking and BDP-FC. IRN, therefore, splits the PSN counter into two different ones (1) spsn to track the request packets, and (2) rpsn to track the response packets. This decoupling remains transparent to the application and does not require any packet format changes. A detailed example for this is provided in C.3 of the appendix. IRN can also support shared receive queues and send with invalidate operations and is compatible with use of end-to-end credit (discussed in more details in C.4-C.7 of the attached appendix). 5 Implementation Feasibility We evaluate the feasibility of implementing IRN by breaking up the implementation overheads into three categories: (1) in 5.1, we do a comparative analysis of IRN s memory requirements; (2) in 5.2, we evaluate the overhead for implementing IRN s packet processing logic by synthesizing it on an FPGA; and (3) in 5.3, we evaluate via simulations how IRN s implementation-specific choices impact the end-to-end performance. 5.1 NIC State overhead due to IRN IRN s implementation requires some additional state on the NIC, that we summarize below. Additional Per-QP Context: State variables: IRN needs a total of 52 bits of additional state for its transport logic: 24 bits each for retransmit sequence and recovery sequence, and 4 bits for various flags. Other per-flow state variables needed for implementing IRN s transport logic (such as the expected sequence number, the next sequence to send and so on) are already maintained by current RoCE NICs. This accounts for 9

10 104 bits per QP (52 bits each at the requester and the responder). Maintaining a timer at the responder for Read timeouts and a variable to track incomplete Read request adds another 56 bits to the responder leading to a total of 160 bits (or 20 bytes) of additional per-qp state with IRN. For context, RoCE NICs currently maintain a few thousands of bits of per QP for various state variables [3]. Bitmaps: IRN requires five BDP sized bitmaps: two at the responder for the 2-bitmap to track request packets, one at the requester to track the Read responses, one each at the requester and responder for tracking selective acks. Assuming each bitmap to be 128 bits (i.e., sized at the BDP cap for a network with bandwidth 40Gbps and a twoway propagation delay of up to 24µs, typical in today s datacenter topologies [29]) 13, IRN would require a total of 640 bits (80 bytes) per QP for bitmaps. This is much less than the total size of bitmaps maintained by a QP for the OOO support in Mellanox ConnectX-5 NICs [3]. Others: Other per-qp meta-data that is needed by an IRN driver when a WQE is posted (e.g the counters used for assigning WQE SNs) or when it is expired (e.g. premature CQEs) can be stored directly in the main memory and do not add to the NIC memory overhead. Additional Per-WQE Context: As described in 4, IRN requires certain types of WQEs to maintain WQE SNs, which adds a total of 3 bytes to the per-wqe context (currently sized at 64 bytes [3]). Additional Shared State: IRN also maintains some additional variables (or parameters) that are shared across QPs. This includes the BDP cap value, the RTO low value, and N for RTO low, which adds up to a total of 10 bytes. Mellanox RoCE NICs support several MBs of cache to store various metadata including per-qp and per-wqe contexts [3]. This implies that the additional state added by IRN only consumes 3-10% of the NIC cache size for a couple of thousands of QPs and tens of thousands of WQEs (even when the size of the bitmap is increased to support 100Gbps links). 5.2 IRN s packet processing overhead We now evaluate the feasibility of implementing IRN s per-packet processing logic, which requires various bitmap manipulations and can be identified as the key functionality that differs from what is currently implemented in RoCE NICs. The basic provisions required for implementing other IRN changes (such as adding header extensions to packets, or premature CQE generation) are already implemented in RoCE NICs and can be easily extended for IRN. We use Xilinx Vivado Design Suite [4] to do a high-level synthesis of the four key packet processing modules (as described below), targeting Kintex Ultrascale 13 We pad the bitmap to a multiple of 32, for ease of implementation as described in 5.2 XCKU060 FPGA, which is supported as a bump-on-thewire on Mellanox Innova Flex 4 10/40Gbps NICs [13] Synthesis Process To focus on the additional packet processing complexity due to IRN, our implementation for the four modules is stripped-down. More specifically, each module receives the relevant packet metadata and the QP context as streamed inputs, relying on a RoCE NIC s existing implementation to parse the packet headers and retrieve the QP context from the NIC cache (or the system memory, in case of a cache miss). The updated QP context is passed as streamed output from the module, along with other relevant module-specific outputs as described below. (1) receivedata: Triggered on a packet arrival, it outputs the relevant information required to generate an ACK/- NACK packet and the number of premature CQEs that can be expired, along with the updated QP context (e.g. bitmaps, expected sequence number, MSN). (2) txfree: Triggered when the link s Tx is free for the QP to transmit, it outputs the sequence number of the packet to be (re-)transmitted and the updated QP context (e.g. next sequence to transmit). During loss-recovery, it also performs a look ahead by searching the SACK bitmap for the next packet sequence to be retransmitted. (3) receiveack: Triggered when an ACK/NACK packet arrives, it outputs the updated QP context (e.g. SACK bitmap, last acknowledged sequence). (4) timeout: If triggered when the timer expires using RTO low value (indicated by a flag in the QP context), it checks if the condition for using RTO low holds. If not, it does not take any action and sets an output flag to extend the timeout to RTO high. In other cases, it executes the timeout action and returns the updated QP context. Our implementation relies on existing RoCE NIC s support for setting timers, with the RTO low value being used by default, unless explicitly extended. The bitmap manipulations in the first three modules account for most of the complexity in our synthesis. Each bitmap was implemented as a ring buffer, using an arbitrary precision variable of 128 bits, with the head corresponding to the expected sequence number at the receiver (or the cumulative acknowledgement number at the sender). The key bitmap manipulations required by IRN can be reduced to the following three categories of known operations: (i) finding first zero, to find the next expected sequence number in receivedata and the next packet to retransmit in txfree (ii) popcount to compute the increment in MSN and the number of CQEs to be expired in receivedata, (iii) bit shifts to advance the bitmap heads in receivedata and receiveack. We optimized the first two operations by dividing the bitmap variables into chunks of 32 bits and operating on these chunks in parallel. We validated the correctness of our implementation by 10

Revisiting Network Support for RDMA

Revisiting Network Support for RDMA Revisiting Network Support for RDMA Radhika Mittal 1, Alex Shpiner 3, Aurojit Panda 1, Eitan Zahavi 3, Arvind Krishnamurthy 2, Sylvia Ratnasamy 1, Scott Shenker 1 (1: UC Berkeley, 2: Univ. of Washington,

More information

RoGUE: RDMA over Generic Unconverged Ethernet

RoGUE: RDMA over Generic Unconverged Ethernet RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA USER KERNEL Zero Copy Application Application Buffer Buffer HARWARE

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

P802.1Qcz Congestion Isolation

P802.1Qcz Congestion Isolation P802.1Qcz Congestion Isolation IEEE 802 / IETF Workshop on Data Center Networking Bangkok November 2018 Paul Congdon (Huawei/Tallac) The Case for Low-latency, Lossless, Large-Scale DCNs More and more latency-sensitive

More information

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook EXPERIENCES EVALUATING DCTCP Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook INTRODUCTION Standard TCP congestion control, which only reacts to packet losses has many problems Can

More information

Appendix B. Standards-Track TCP Evaluation

Appendix B. Standards-Track TCP Evaluation 215 Appendix B Standards-Track TCP Evaluation In this appendix, I present the results of a study of standards-track TCP error recovery and queue management mechanisms. I consider standards-track TCP error

More information

Advanced Computer Networks. Datacenter TCP

Advanced Computer Networks. Datacenter TCP Advanced Computer Networks 263 3501 00 Datacenter TCP Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Today Problems with TCP in the Data Center TCP Incast TPC timeouts Improvements

More information

Performance of UMTS Radio Link Control

Performance of UMTS Radio Link Control Performance of UMTS Radio Link Control Qinqing Zhang, Hsuan-Jung Su Bell Laboratories, Lucent Technologies Holmdel, NJ 77 Abstract- The Radio Link Control (RLC) protocol in Universal Mobile Telecommunication

More information

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components

More information

UNIT IV -- TRANSPORT LAYER

UNIT IV -- TRANSPORT LAYER UNIT IV -- TRANSPORT LAYER TABLE OF CONTENTS 4.1. Transport layer. 02 4.2. Reliable delivery service. 03 4.3. Congestion control. 05 4.4. Connection establishment.. 07 4.5. Flow control 09 4.6. Transmission

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015 Oriana Riva, Department of Computer Science ETH Zürich 1 Today Flow Control Store-and-forward,

More information

RDMA over Commodity Ethernet at Scale

RDMA over Commodity Ethernet at Scale RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016 Outline RDMA/RoCEv2 background DSCP-based

More information

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

Transmission Control Protocol. ITS 413 Internet Technologies and Applications Transmission Control Protocol ITS 413 Internet Technologies and Applications Contents Overview of TCP (Review) TCP and Congestion Control The Causes of Congestion Approaches to Congestion Control TCP Congestion

More information

IEEE P802.1Qcz Proposed Project for Congestion Isolation

IEEE P802.1Qcz Proposed Project for Congestion Isolation IEEE P82.1Qcz Proposed Project for Congestion Isolation IETF 11 London ICCRG Paul Congdon paul.congdon@tallac.com Project Background P82.1Qcz Project Initiation November 217 - Agreed to develop a Project

More information

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout An Implementation of the Homa Transport Protocol in RAMCloud Yilong Li, Behnam Montazeri, John Ousterhout Introduction Homa: receiver-driven low-latency transport protocol using network priorities HomaTransport

More information

Congestion Control in Datacenters. Ahmed Saeed

Congestion Control in Datacenters. Ahmed Saeed Congestion Control in Datacenters Ahmed Saeed What is a Datacenter? Tens of thousands of machines in the same building (or adjacent buildings) Hundreds of switches connecting all machines What is a Datacenter?

More information

Transport layer issues

Transport layer issues Transport layer issues Dmitrij Lagutin, dlagutin@cc.hut.fi T-79.5401 Special Course in Mobility Management: Ad hoc networks, 28.3.2007 Contents Issues in designing a transport layer protocol for ad hoc

More information

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Lecture 21: Congestion Control CSE 123: Computer Networks Alex C. Snoeren Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren Lecture 21 Overview" How fast should a sending host transmit data? Not to fast, not to slow, just right Should not be faster than

More information

Fast Retransmit. Problem: coarsegrain. timeouts lead to idle periods Fast retransmit: use duplicate ACKs to trigger retransmission

Fast Retransmit. Problem: coarsegrain. timeouts lead to idle periods Fast retransmit: use duplicate ACKs to trigger retransmission Fast Retransmit Problem: coarsegrain TCP timeouts lead to idle periods Fast retransmit: use duplicate ACKs to trigger retransmission Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Sender Receiver

More information

EECS 122, Lecture 19. Reliable Delivery. An Example. Improving over Stop & Wait. Picture of Go-back-n/Sliding Window. Send Window Maintenance

EECS 122, Lecture 19. Reliable Delivery. An Example. Improving over Stop & Wait. Picture of Go-back-n/Sliding Window. Send Window Maintenance EECS 122, Lecture 19 Today s Topics: More on Reliable Delivery Round-Trip Timing Flow Control Intro to Congestion Control Kevin Fall, kfall@cs cs.berkeley.eduedu Reliable Delivery Stop and Wait simple

More information

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination Keepalive Timer! Yet another timer in TCP is the keepalive! This one is not required, and some implementations do not have it as it is occasionally regarded as controversial! When a TCP connection is idle

More information

Congestion Control in Communication Networks

Congestion Control in Communication Networks Congestion Control in Communication Networks Introduction Congestion occurs when number of packets transmitted approaches network capacity Objective of congestion control: keep number of packets below

More information

Lecture 14: Congestion Control"

Lecture 14: Congestion Control Lecture 14: Congestion Control" CSE 222A: Computer Communication Networks George Porter Thanks: Amin Vahdat, Dina Katabi and Alex C. Snoeren Lecture 14 Overview" TCP congestion control review Dukkipati

More information

Impact of transmission errors on TCP performance. Outline. Random Errors

Impact of transmission errors on TCP performance. Outline. Random Errors Impact of transmission errors on TCP performance 1 Outline Impact of transmission errors on TCP performance Approaches to improve TCP performance Classification Discussion of selected approaches 2 Random

More information

Bandwidth Allocation & TCP

Bandwidth Allocation & TCP Bandwidth Allocation & TCP The Transport Layer Focus Application Presentation How do we share bandwidth? Session Topics Transport Network Congestion control & fairness Data Link TCP Additive Increase/Multiplicative

More information

CSE 123A Computer Networks

CSE 123A Computer Networks CSE 123A Computer Networks Winter 2005 Lecture 14 Congestion Control Some images courtesy David Wetherall Animations by Nick McKeown and Guido Appenzeller The bad news and the good news The bad news: new

More information

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks Dr. Vinod Vokkarane Assistant Professor, Computer and Information Science Co-Director, Advanced Computer Networks Lab University

More information

F-RTO: An Enhanced Recovery Algorithm for TCP Retransmission Timeouts

F-RTO: An Enhanced Recovery Algorithm for TCP Retransmission Timeouts F-RTO: An Enhanced Recovery Algorithm for TCP Retransmission Timeouts Pasi Sarolahti Nokia Research Center pasi.sarolahti@nokia.com Markku Kojo, Kimmo Raatikainen University of Helsinki Department of Computer

More information

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data?

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data? Congestion Control End Hosts CSE 51 Lecture 7, Spring. David Wetherall Today s question How fast should the sender transmit data? Not tooslow Not toofast Just right Should not be faster than the receiver

More information

Programming Assignment 3: Transmission Control Protocol

Programming Assignment 3: Transmission Control Protocol CS 640 Introduction to Computer Networks Spring 2005 http://www.cs.wisc.edu/ suman/courses/640/s05 Programming Assignment 3: Transmission Control Protocol Assigned: March 28,2005 Due: April 15, 2005, 11:59pm

More information

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

TCP so far Computer Networking Outline. How Was TCP Able to Evolve TCP so far 15-441 15-441 Computer Networking 15-641 Lecture 14: TCP Performance & Future Peter Steenkiste Fall 2016 www.cs.cmu.edu/~prs/15-441-f16 Reliable byte stream protocol Connection establishments

More information

Chapter III. congestion situation in Highspeed Networks

Chapter III. congestion situation in Highspeed Networks Chapter III Proposed model for improving the congestion situation in Highspeed Networks TCP has been the most used transport protocol for the Internet for over two decades. The scale of the Internet and

More information

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT CS 421: COMPUTER NETWORKS SPRING 2012 FINAL May 24, 2012 150 minutes Name: Student No: Show all your work very clearly. Partial credits will only be given if you carefully state your answer with a reasonable

More information

Lecture 15: TCP over wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 13, Thursday

Lecture 15: TCP over wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 13, Thursday Lecture 15: TCP over wireless networks Mythili Vutukuru CS 653 Spring 2014 March 13, Thursday TCP - recap Transport layer TCP is the dominant protocol TCP provides in-order reliable byte stream abstraction

More information

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP CS 5520/ECE 5590NA: Network Architecture I Spring 2008 Lecture 13: UDP and TCP Most recent lectures discussed mechanisms to make better use of the IP address space, Internet control messages, and layering

More information

CS268: Beyond TCP Congestion Control

CS268: Beyond TCP Congestion Control TCP Problems CS68: Beyond TCP Congestion Control Ion Stoica February 9, 004 When TCP congestion control was originally designed in 1988: - Key applications: FTP, E-mail - Maximum link bandwidth: 10Mb/s

More information

A Two-level Threshold Recovery Mechanism for SCTP

A Two-level Threshold Recovery Mechanism for SCTP A Two-level Threshold Recovery Mechanism for SCTP Armando L. Caro Jr., Janardhan R. Iyengar, Paul D. Amer, Gerard J. Heinz Computer and Information Sciences University of Delaware acaro, iyengar, amer,

More information

Randomization. Randomization used in many protocols We ll study examples:

Randomization. Randomization used in many protocols We ll study examples: Randomization Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling 1 Ethernet Single shared broadcast channel 2+ simultaneous

More information

Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling

Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling Randomization Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling 1 Ethernet Single shared broadcast channel 2+ simultaneous

More information

Outline 9.2. TCP for 2.5G/3G wireless

Outline 9.2. TCP for 2.5G/3G wireless Transport layer 9.1 Outline Motivation, TCP-mechanisms Classical approaches (Indirect TCP, Snooping TCP, Mobile TCP) PEPs in general Additional optimizations (Fast retransmit/recovery, Transmission freezing,

More information

Communication Networks

Communication Networks Communication Networks Spring 2018 Laurent Vanbever nsg.ee.ethz.ch ETH Zürich (D-ITET) April 30 2018 Materials inspired from Scott Shenker & Jennifer Rexford Last week on Communication Networks We started

More information

CS244 Advanced Topics in Computer Networks Midterm Exam Monday, May 2, 2016 OPEN BOOK, OPEN NOTES, INTERNET OFF

CS244 Advanced Topics in Computer Networks Midterm Exam Monday, May 2, 2016 OPEN BOOK, OPEN NOTES, INTERNET OFF CS244 Advanced Topics in Computer Networks Midterm Exam Monday, May 2, 2016 OPEN BOOK, OPEN NOTES, INTERNET OFF Your Name: Answers SUNet ID: root @stanford.edu In accordance with both the letter and the

More information

Chapter 24. Transport-Layer Protocols

Chapter 24. Transport-Layer Protocols Chapter 24. Transport-Layer Protocols 23.1 Introduction 23.2 User Datagram Protocol 23.3 Transmission Control Protocol 23.4 SCTP Computer Networks 24-1 Position of Transport-Layer Protocols UDP is an unreliable

More information

Cloud e Datacenter Networking

Cloud e Datacenter Networking Cloud e Datacenter Networking Università degli Studi di Napoli Federico II Dipartimento di Ingegneria Elettrica e delle Tecnologie dell Informazione DIETI Laurea Magistrale in Ingegneria Informatica Prof.

More information

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms Overview Chapter 13 TRANSPORT Motivation Simple analysis Various TCP mechanisms Distributed Computing Group Mobile Computing Winter 2005 / 2006 Distributed Computing Group MOBILE COMPUTING R. Wattenhofer

More information

TCP Performance. EE 122: Intro to Communication Networks. Fall 2006 (MW 4-5:30 in Donner 155) Vern Paxson TAs: Dilip Antony Joseph and Sukun Kim

TCP Performance. EE 122: Intro to Communication Networks. Fall 2006 (MW 4-5:30 in Donner 155) Vern Paxson TAs: Dilip Antony Joseph and Sukun Kim TCP Performance EE 122: Intro to Communication Networks Fall 2006 (MW 4-5:30 in Donner 155) Vern Paxson TAs: Dilip Antony Joseph and Sukun Kim http://inst.eecs.berkeley.edu/~ee122/ Materials with thanks

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Sequence Number. Acknowledgment Number. Data

Sequence Number. Acknowledgment Number. Data CS 455 TCP, Page 1 Transport Layer, Part II Transmission Control Protocol These slides are created by Dr. Yih Huang of George Mason University. Students registered in Dr. Huang's courses at GMU can make

More information

IIP Wireless. Presentation Outline

IIP Wireless. Presentation Outline IIP Wireless Improving Internet Protocols for Wireless Links Markku Kojo Department of Computer Science www.cs cs.helsinki.fi/research/.fi/research/iwtcp/ 1 Presentation Outline Project Project Summary

More information

Lecture 16: Wireless Networks

Lecture 16: Wireless Networks &6( *UDGXDWH1HWZRUNLQJ :LQWHU Lecture 16: Wireless Networks Geoffrey M. Voelker :LUHOHVV1HWZRUNLQJ Many topics in wireless networking Transport optimizations, ad hoc routing, MAC algorithms, QoS, mobility,

More information

6.033 Spring 2015 Lecture #11: Transport Layer Congestion Control Hari Balakrishnan Scribed by Qian Long

6.033 Spring 2015 Lecture #11: Transport Layer Congestion Control Hari Balakrishnan Scribed by Qian Long 6.033 Spring 2015 Lecture #11: Transport Layer Congestion Control Hari Balakrishnan Scribed by Qian Long Please read Chapter 19 of the 6.02 book for background, especially on acknowledgments (ACKs), timers,

More information

Congestion control in TCP

Congestion control in TCP Congestion control in TCP If the transport entities on many machines send too many packets into the network too quickly, the network will become congested, with performance degraded as packets are delayed

More information

CSE 4215/5431: Mobile Communications Winter Suprakash Datta

CSE 4215/5431: Mobile Communications Winter Suprakash Datta CSE 4215/5431: Mobile Communications Winter 2013 Suprakash Datta datta@cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku.ca/course/4215 Some slides are adapted

More information

ECSE-6600: Internet Protocols Spring 2007, Exam 1 SOLUTIONS

ECSE-6600: Internet Protocols Spring 2007, Exam 1 SOLUTIONS ECSE-6600: Internet Protocols Spring 2007, Exam 1 SOLUTIONS Time: 75 min (strictly enforced) Points: 50 YOUR NAME (1 pt): Be brief, but DO NOT omit necessary detail {Note: Simply copying text directly

More information

Episode 5. Scheduling and Traffic Management

Episode 5. Scheduling and Traffic Management Episode 5. Scheduling and Traffic Management Part 3 Baochun Li Department of Electrical and Computer Engineering University of Toronto Outline What is scheduling? Why do we need it? Requirements of a scheduling

More information

CS 640 Introduction to Computer Networks Spring 2009

CS 640 Introduction to Computer Networks Spring 2009 CS 640 Introduction to Computer Networks Spring 2009 http://pages.cs.wisc.edu/~suman/courses/wiki/doku.php?id=640-spring2009 Programming Assignment 3: Transmission Control Protocol Assigned: March 26,

More information

CS519: Computer Networks. Lecture 5, Part 4: Mar 29, 2004 Transport: TCP congestion control

CS519: Computer Networks. Lecture 5, Part 4: Mar 29, 2004 Transport: TCP congestion control : Computer Networks Lecture 5, Part 4: Mar 29, 2004 Transport: TCP congestion control TCP performance We ve seen how TCP the protocol works Sequencing, receive window, connection setup and teardown And

More information

Transport Protocols & TCP TCP

Transport Protocols & TCP TCP Transport Protocols & TCP CSE 3213 Fall 2007 13 November 2007 1 TCP Services Flow control Connection establishment and termination Congestion control 2 1 TCP Services Transmission Control Protocol (RFC

More information

Improving the Robustness of TCP to Non-Congestion Events

Improving the Robustness of TCP to Non-Congestion Events Improving the Robustness of TCP to Non-Congestion Events Presented by : Sally Floyd floyd@acm.org For the Authors: Sumitha Bhandarkar A. L. Narasimha Reddy {sumitha,reddy}@ee.tamu.edu Problem Statement

More information

EE 122: IP Forwarding and Transport Protocols

EE 122: IP Forwarding and Transport Protocols EE 1: IP Forwarding and Transport Protocols Ion Stoica (and Brighten Godfrey) TAs: Lucian Popa, David Zats and Ganesh Ananthanarayanan http://inst.eecs.berkeley.edu/~ee1/ (Materials with thanks to Vern

More information

CMPE 257: Wireless and Mobile Networking

CMPE 257: Wireless and Mobile Networking CMPE 257: Wireless and Mobile Networking Katia Obraczka Computer Engineering UCSC Baskin Engineering Lecture 10 CMPE 257 Spring'15 1 Student Presentations Schedule May 21: Sam and Anuj May 26: Larissa

More information

CSE 461. TCP and network congestion

CSE 461. TCP and network congestion CSE 461 TCP and network congestion This Lecture Focus How should senders pace themselves to avoid stressing the network? Topics Application Presentation Session Transport Network congestion collapse Data

More information

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness Recap TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness 81 Feedback Signals Several possible signals, with different

More information

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1 6. Transport Layer 6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1 6.1 Internet Transport Layer Architecture The

More information

RoCE vs. iwarp Competitive Analysis

RoCE vs. iwarp Competitive Analysis WHITE PAPER February 217 RoCE vs. iwarp Competitive Analysis Executive Summary...1 RoCE s Advantages over iwarp...1 Performance and Benchmark Examples...3 Best Performance for Virtualization...5 Summary...6

More information

Congestion Control in TCP

Congestion Control in TCP Congestion Control in TCP Antonio Carzaniga Faculty of Informatics University of Lugano May 6, 2005 Outline Intro to congestion control Input rate vs. output throughput Congestion window Congestion avoidance

More information

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking

More information

Reliable Transport I: Concepts and TCP Protocol

Reliable Transport I: Concepts and TCP Protocol Reliable Transport I: Concepts and TCP Protocol Stefano Vissicchio UCL Computer Science COMP0023 Today Transport Concepts Layering context Transport goals Transport mechanisms and design choices TCP Protocol

More information

image 3.8 KB Figure 1.6: Example Web Page

image 3.8 KB Figure 1.6: Example Web Page image. KB image 1 KB Figure 1.: Example Web Page and is buffered at a router, it must wait for all previously queued packets to be transmitted first. The longer the queue (i.e., the more packets in the

More information

Transport Protocols and TCP: Review

Transport Protocols and TCP: Review Transport Protocols and TCP: Review CSE 6590 Fall 2010 Department of Computer Science & Engineering York University 1 19 September 2010 1 Connection Establishment and Termination 2 2 1 Connection Establishment

More information

Congestion / Flow Control in TCP

Congestion / Flow Control in TCP Congestion and Flow Control in 1 Flow Control and Congestion Control Flow control Sender avoids overflow of receiver buffer Congestion control All senders avoid overflow of intermediate network buffers

More information

Lecture 14: Congestion Control"

Lecture 14: Congestion Control Lecture 14: Congestion Control" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Amin Vahdat, Dina Katabi Lecture 14 Overview" TCP congestion control review XCP Overview 2 Congestion Control

More information

II. Principles of Computer Communications Network and Transport Layer

II. Principles of Computer Communications Network and Transport Layer II. Principles of Computer Communications Network and Transport Layer A. Internet Protocol (IP) IPv4 Header An IP datagram consists of a header part and a text part. The header has a 20-byte fixed part

More information

Performance Consequences of Partial RED Deployment

Performance Consequences of Partial RED Deployment Performance Consequences of Partial RED Deployment Brian Bowers and Nathan C. Burnett CS740 - Advanced Networks University of Wisconsin - Madison ABSTRACT The Internet is slowly adopting routers utilizing

More information

ROBUST TCP: AN IMPROVEMENT ON TCP PROTOCOL

ROBUST TCP: AN IMPROVEMENT ON TCP PROTOCOL ROBUST TCP: AN IMPROVEMENT ON TCP PROTOCOL SEIFEDDINE KADRY 1, ISSA KAMAR 1, ALI KALAKECH 2, MOHAMAD SMAILI 1 1 Lebanese University - Faculty of Science, Lebanon 1 Lebanese University - Faculty of Business,

More information

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014 1 Congestion Control In The Internet Part 2: How it is implemented in TCP JY Le Boudec 2014 Contents 1. Congestion control in TCP 2. The fairness of TCP 3. The loss throughput formula 4. Explicit Congestion

More information

Reliable Transport II: TCP and Congestion Control

Reliable Transport II: TCP and Congestion Control Reliable Transport II: TCP and Congestion Control Stefano Vissicchio UCL Computer Science COMP0023 Recap: Last Lecture Transport Concepts Layering context Transport goals Transport mechanisms and design

More information

Baidu s Best Practice with Low Latency Networks

Baidu s Best Practice with Low Latency Networks Baidu s Best Practice with Low Latency Networks Feng Gao IEEE 802 IC NEND Orlando, FL November 2017 Presented by Huawei Low Latency Network Solutions 01 1. Background Introduction 2. Network Latency Analysis

More information

Data Link Control Protocols

Data Link Control Protocols Protocols : Introduction to Data Communications Sirindhorn International Institute of Technology Thammasat University Prepared by Steven Gordon on 23 May 2012 Y12S1L07, Steve/Courses/2012/s1/its323/lectures/datalink.tex,

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 11, 2018

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 11, 2018 CMSC 417 Computer Networks Prof. Ashok K Agrawala 2018 Ashok Agrawala Message, Segment, Packet, and Frame host host HTTP HTTP message HTTP TCP TCP segment TCP router router IP IP packet IP IP packet IP

More information

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing.

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing. IP Forwarding & Transport Protocols EE 122: Intro to Communication Networks Fall 2007 (WF 4-5:30 in Cory 277) Vern Paxson TAs: Lisa Fowler, Daniel Killebrew & Jorge Ortiz http://inst.eecs.berkeley.edu/~ee122/

More information

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers Overview 15-441 Computer Networking TCP & router queuing Lecture 10 TCP & Routers TCP details Workloads Lecture 10: 09-30-2002 2 TCP Performance TCP Performance Can TCP saturate a link? Congestion control

More information

Congestion Control. Brighten Godfrey CS 538 January Based in part on slides by Ion Stoica

Congestion Control. Brighten Godfrey CS 538 January Based in part on slides by Ion Stoica Congestion Control Brighten Godfrey CS 538 January 31 2018 Based in part on slides by Ion Stoica Announcements A starting point: the sliding window protocol TCP flow control Make sure receiving end can

More information

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015 1 Congestion Control In The Internet Part 2: How it is implemented in TCP JY Le Boudec 2015 Contents 1. Congestion control in TCP 2. The fairness of TCP 3. The loss throughput formula 4. Explicit Congestion

More information

Network-Adaptive Video Coding and Transmission

Network-Adaptive Video Coding and Transmission Header for SPIE use Network-Adaptive Video Coding and Transmission Kay Sripanidkulchai and Tsuhan Chen Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213

More information

Mobile Transport Layer

Mobile Transport Layer Mobile Transport Layer 1 Transport Layer HTTP (used by web services) typically uses TCP Reliable transport between TCP client and server required - Stream oriented, not transaction oriented - Network friendly:

More information

Chapter 3 outline. 3.5 Connection-oriented transport: TCP. 3.6 Principles of congestion control 3.7 TCP congestion control

Chapter 3 outline. 3.5 Connection-oriented transport: TCP. 3.6 Principles of congestion control 3.7 TCP congestion control Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4 Principles of reliable data transfer 3.5 Connection-oriented transport: TCP segment

More information

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources Congestion Source 1 Source 2 10-Mbps Ethernet 100-Mbps FDDI Router 1.5-Mbps T1 link Destination Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets

More information

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014 1 Congestion Control In The Internet Part 2: How it is implemented in TCP JY Le Boudec 2014 Contents 1. Congestion control in TCP 2. The fairness of TCP 3. The loss throughput formula 4. Explicit Congestion

More information

Basic Reliable Transport Protocols

Basic Reliable Transport Protocols Basic Reliable Transport Protocols Do not be alarmed by the length of this guide. There are a lot of pictures. You ve seen in lecture that most of the networks we re dealing with are best-effort : they

More information

ETSF05/ETSF10 Internet Protocols Transport Layer Protocols

ETSF05/ETSF10 Internet Protocols Transport Layer Protocols ETSF05/ETSF10 Internet Protocols Transport Layer Protocols 2016 Jens Andersson Transport Layer Communication between applications Process-to-process delivery Client/server concept Local host Normally initialiser

More information

CS321: Computer Networks Congestion Control in TCP

CS321: Computer Networks Congestion Control in TCP CS321: Computer Networks Congestion Control in TCP Dr. Manas Khatua Assistant Professor Dept. of CSE IIT Jodhpur E-mail: manaskhatua@iitj.ac.in Causes and Cost of Congestion Scenario-1: Two Senders, a

More information

Mobile Communications Chapter 9: Mobile Transport Layer

Mobile Communications Chapter 9: Mobile Transport Layer Prof. Dr.-Ing Jochen H. Schiller Inst. of Computer Science Freie Universität Berlin Germany Mobile Communications Chapter 9: Mobile Transport Layer Motivation, TCP-mechanisms Classical approaches (Indirect

More information

Request for Comments: S. Floyd ICSI K. Ramakrishnan AT&T Labs Research June 2009

Request for Comments: S. Floyd ICSI K. Ramakrishnan AT&T Labs Research June 2009 Network Working Group Request for Comments: 5562 Category: Experimental A. Kuzmanovic A. Mondal Northwestern University S. Floyd ICSI K. Ramakrishnan AT&T Labs Research June 2009 Adding Explicit Congestion

More information

CMPE150 Midterm Solutions

CMPE150 Midterm Solutions CMPE150 Midterm Solutions Question 1 Packet switching and circuit switching: (a) Is the Internet a packet switching or circuit switching network? Justify your answer. The Internet is a packet switching

More information

Computer Networking. Queue Management and Quality of Service (QOS)

Computer Networking. Queue Management and Quality of Service (QOS) Computer Networking Queue Management and Quality of Service (QOS) Outline Previously:TCP flow control Congestion sources and collapse Congestion control basics - Routers 2 Internet Pipes? How should you

More information

ECE697AA Lecture 3. Today s lecture

ECE697AA Lecture 3. Today s lecture ECE697AA Lecture 3 Transport Layer: TCP and UDP Tilman Wolf Department of Electrical and Computer Engineering 09/09/08 Today s lecture Transport layer User datagram protocol (UDP) Reliable data transfer

More information

Question. Reliable Transport: The Prequel. Don t parse my words too carefully. Don t be intimidated. Decisions and Their Principles.

Question. Reliable Transport: The Prequel. Don t parse my words too carefully. Don t be intimidated. Decisions and Their Principles. Question How many people have not yet participated? Reliable Transport: The Prequel EE122 Fall 2012 Scott Shenker http://inst.eecs.berkeley.edu/~ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica,

More information

DIBS: Just-in-time congestion mitigation for Data Centers

DIBS: Just-in-time congestion mitigation for Data Centers DIBS: Just-in-time congestion mitigation for Data Centers Kyriakos Zarifis, Rui Miao, Matt Calder, Ethan Katz-Bassett, Minlan Yu, Jitendra Padhye University of Southern California Microsoft Research Summary

More information