1 Congestion Control In The Internet Part 2: How it is implemented in TCP JY Le Boudec 2015
Contents 1. Congestion control in TCP 2. The fairness of TCP 3. The loss throughput formula 4. Explicit Congestion Notification (ECN) and RED 5. Other Bells and Whistles 2
1. Congestion Control in the Internet is in TCP TCP is used to avoid congestion in the Internet in addition to what was shown: a TCP source adjusts its window to the congestion status of the Internet (slow start, congestion avoidance) this avoids congestion collapse and ensures some fairness TCP sources interprets losses as a negative feedback use to reduce the sending rate UDP sources are a problem for the Internet use for long lived sessions (ex: RealAudio) is a threat: congestion collapse UDP sources should imitate TCP : TCP friendly 3
TCP Congestion Control is based on AIMD TCP adjusts the window size (in addition to offered window ie credit mechanism) based on the belief that rate W = min (cwnd, OfferedWindow) OfferedWindow = window obtained by TCP s window field cwnd = controlled by TCP congestion control Principles of TCP Congestion Control negative feedback = loss, positive feedback = ACK received Additive Increase (+1 Maximum Segment Size MSS), Multiplicative Decrease ( 0.5) Slow start with increase factor 2 In addition to AIMD: when loss is detected by timeout > slow start Loss detected by fast retransmit => fast recovery (no slow start) 4
A Trace Bytes 60 30 twnd (= ssthresh) A B C cwnd 0 0 1 2 3 4 5 6 7 8 9 seconds created from data from: IEEE Transactions on Networking, Oct. 95, TCP Vegas, L. Brakmo and L. Petersen 5
Bytes 60 30 twnd cwnd A Trace A B C 0 0 1 2 3 4 5 6 7 8 9 seconds B congestion avoidance slow start C congestion avoidance created from data from: IEEE Transactions on Networking, Oct. 95, TCP Vegas, L. Brakmo and L. Petersen 6
7 How TCP approximates multiplicative decrease : (on loss detection by timeout) cwnd = 0.5 current_window cwnd = max (cwnd, 2 MSS) additive increase : for each ACK received cwnd = cwnd + MSS MSS / cwnd This is equivalent in packets to cwnd cwnd 1 cwnd cwnd = min (cwnd, max size) (64KB) cwnd = 1 MSS 2 3 4
How TCP approximates multiplicative increase : (Slow Start) non dupl. ack received during slow start > cwnd = cwnd + MSS (in bytes) (1) if cwnd = twnd then go to congestion avoidance cwnd = 1 seg 2 3 4 5 6 7 8 (1) is equivalent in packets to cwnd = cwnd + 1 (in packets) 8
9 AIMD and Slow Start window size loss loss slow start
10 A finite state machine description (incomplete) connection opening: twnd = 65535 B cwnd = 1 seg Slow Start exponential increase for cwnd until cwnd = twnd retransmission timeout: - multiplicative decrease for twnd - cwnd = 1 seg cwnd = twnd Congestion Avoidance additive increase for twnd, cwnd = twnd retransmission timeout: - multiplicative decrease for twnd - cwnd = 1 seg notes this shows only 2 states out of 3 twnd = target window
Assume a TCP flow uses WiFi with high loss ratio. Assume some packets are lost in spite of WiFi retransmissions. When a packet is lost on the WiFi link 1. The TCP source knows it is a loss due to channel errors and not congestion, therefore does not reduce the window 2. The TCP source thinks it is a congestion loss and reduces its window 3. It depends if the MAC layer uses retransmissions 4. I don t know 0% 0% 0% 0% 1. 2. 3. 4. 11
Fast Recovery Slow start used when we assume that the network condition is new or abruptly changing i.e. at beginning and after loss detected by timeout In all other packet loss detection events, slow start is not used, but fast recovery is used instead With Fast Recovery target window is halved But congestion window is allowed to increase beyond the target window until the loss is repaired 12
Fast Recovery Details When loss is detected by 3 duplicate acks twnd = 0.5 current-size twnd = max (twnd, 2 MSS) cwnd = twnd + 3 MSS (exp. increase) cwnd = min (cwnd, 64K) For each duplicated ACK received cwnd = cwnd + MSS (exp. increase) cwnd = min (cwnd, 64K) When loss is repaired cwnd = twnd Goto congestion avoidance 13
Fast Recovery Example TcpMaxDupACKs=3 twnd=cwnd = 800 seq=201:301 seq=301:351 seq=351:401 seq=401:501 twnd=cwnd=813 seq=501:601 seq=601:701 seq=701:751 twnd=407, cwnd=707 seq=201:301 seq=751:801 twnd=407, cwnd=807 twnd=407, cwnd=907 seq=801:901 twnd=407, cwnd=1007 twnd=407, cwnd=407 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 101,win=1 000 Ack = 801,win=1 000 14
15 At time 1, the sender is in congestion avoidance mode. The congestion window increases with every received non duplicate ack (as at time 6). The target window (also called slowstart threshold) is equal to the congestion window. The second packet is lost. At time 12, its loss is detected by fast retransmit, i.e. reception of 3 duplicate acks. The sender goes into fast recovery mode. The target window is set to half the value of the congestion window; the congestion window is set to the target window plus 3 packets (one for each duplicate ack received). At time 13 the source retransmits the lost packet. At time 14 it transmits a fresh packet. This is possible because the window is large enough. The window size, which is the minimum of the congestion window and the advertised window, is equal to 707. Since the last acked byte is 101, it is possible to send up to 807. At times 15, 16 and 18, the congestion window is increased by 1 MMS, i.e. 100 bytes, by application of the congestion avoidance algorithm. This allows to send one fresh packet at time 17. At time 18 the lost packet is acked, the source exits the fast recovery mode and enters congestion avoidance. The congestion window is set to the target window.
How many new segments of size 100 bytes can the source send at time 20? A. 1 B. 2 C. 3 D. 4 F. 0 G. I don t know 1 0% 0% 0% 0% 0% 0% 0% 2 3 4 0 I don t know 16
18 Bytes 60 twnd Fast Recovery Example A B C D E F 30 0 0 1 2 3 4 5 6 cwnd E F seconds A-B, E-F: fast recovery C-D: slow start
19 A finite state machine description new connection: Slow Start - exponential increase cwnd = twnd: retr. timeout: Congestion Avoidance - additive increase retr. timeout: retr. timeout: fast retransmit: Fast Recovery - exponential increase beyond twnd fast retransmit: expected ack received:
2. Fairness of TCP TCP implements some approximation of AIMD AIMD is designed to be approximately fair in a single link scenario Q: what happens in a network? Does TCP distribute rates according to max min fairness or proportional fairness? 20
Does TCP distribute rates according to maxmin fairness or proportional fairness? 1. Max min 2. Proportional 3. None of the above 4. I don t know 25% 25% 25% 25% 1. 2. 3. 4. 21
23 TCP and RTT TCP tends to distribute rate so as to maximize utility of source given by The utility depends on the roundtrip time ; The utility U is a decreasing function of What does this imply?
24 and send to destination using one TCP connection each, RTTs are 60ms and 140ms. Bottleneck is link «router destination». Who gets more? gets a higher throughput gets a higher throughput 3. Both get the same 4. I don t know 0% 0% 0% 0% 1. 2. 3. 4. S 1 router destination 10 Mb/s, 20 ms 1 Mb/s 10 ms 10 Mb/s, 60 ms S 2
The Bias of TCP Against Large RTTs With TCP, two competing sources with different RTTs are not treated equally source with large RTT obtains less A source that uses many hops obtains less rate because of two combined factors, one is good, the other is bad: 1. this source uses more resources. The mechanics of proportional fairness leads to this source having less rate this is desirable in view of the theory of fairness. 2. this source has a larger RTT. The mechanics of additive increase leads to this source having less rate this is a bug in the design of TCP Cause is : additive increase is one packet per RTT (instead of one packet per constant time interval) 26
3. TCP Loss Throughput Formula Consider a large TCP connection (many bytes to transmit) Assume we observe that, in average, a fraction q of packets is lost (or marked with ECN) Can we say something about the expected throughput for this connection? 27
TCP Loss Throughput Formula TCP connection throughput (bytes per second) RTT T (seconds) segment size L (bytes) average packet loss ratio 0,1 constant C = 1.22 Assumes: transmission time negligible compared to RTT, losses are rare, time spent in Slow Start and Fast Recovery negligible 28
Guess the ratio between the throughputs 1 and 2 of S 1 and S 2 S 1 router destination 10 Mb/s, 20 ms 1 Mb/s 10 ms 10 Mb/s, 60 ms S 2 5. None of the above 6. I don t know 0% 0% 0% 0% 0% 0% 1. 2. 3. 4. 5. 6. 29
4. Explicit Congestion Notification (ECN) Why invented: signal congestion without dropping packets dropping packets is not nice (delay, retransmissions) (= DECbit) How does it work: router marks packet instead of dropping TCP destination echoes the mark back to source At the source, TCP interprets a marked packet in the same way as if there would be a loss detected by fast retransmit 31
32 Explicit Congestion Notification (ECN) payload TCP IP header header S R S reduces window by 1/2
ECN Flags 2 bits in IP header (direct path) code for 4 possible codepoints non ECN Capable (non ECT) ECN capable and no congestion ECT(0) and ECT(1) ECN capable and congestion experienced (CE) ECT(0) CE CE ECT(0) ECE=1 ECT(0) ECE=1 3 bits in TCP header (return path) ECE (ECN echo) bit = 0 or 1 CWR = ack of ECE=1 received (window reduced) ECE = 1 is set by R until R receives a TCP segment with CWR=1 NS = nonce, used for S to check that ECN is taken seriously. 33
Put correct labels assume TCP with ECN is used and there is no packet loss CA :congestion avoidance=additive increase SS : slow start = multiplicative increase MD : multiplicative decrease 1. 1= CA, 2 = SS 2. 1 = SS, 2 = MD 3. 1 = CA, 2 = MD 4. I don t know 25% 25% 25% 25% 1. 2. 3. 4. 34
36 Nonce Sum (RFC 3540) ECT(0) CE CE ECT(0) ECE=1 ECT(0) ECE=1 Why invented? S uses ECN routers do not drop packets, use ECN instead but ECN could be prevented by: R not implementing ECN, NATs that drop ECN info in header In such cases, the flow of S is not congestion controlled; this should be detected > revert to non ECN What does it do? Nonce Sum bit in TCP header (NS) is used by S to verify that ECN works for this TCP
Nonce Sum ECT(0) CE CE (RFC 3540) ECT(0) ECE=1 How does it work? ECT(0) ECE=1 Source S randomly chooses ECT(0) or ECT(1) in IP header and verifies that the received NS bit is compatible with the ECT(0)/ECT(1) chosen by S Non congested router does nothing; congested router sees ECT(0) or ECT(1) and marks packet as CE instead of dropping it R echoes back to S the xor of all ECT bits in NS field of TCP header; If R does not take ECN seriously, NS does not correspond and S detects it; S detects that ECN does not work; Malicious R cannot compute NS correctly because CE packets do not carry ECT bit If router or NAT drops ECN bits then R cannot compute the correct NS bit and S detects that ECN does not work 37
From RFC 3540 38
39 RED (Random Early Detection) Problem solved by RED = when to mark a packet with ECN Principle: queue estimates its average queue length avg a measured + (1 - a) avg incoming packet is marked with probability given by RED curve a uniformization procedure is also applied to prevent bursts of marking q (marking probability) 1 max p th min th max avg (queue size)
Active Queue Management RED can also be applied even if ECN is not supported In such a case, a packet may be dropped even if there is some space available Expected benefit avoid constantly full buffers avoid irregular drop patterns This is called Active Queue Management as opposed to passive queue management = drop a packet when queue is full = Tail Drop 40
In a network where all flows use TCP with ECN and all routers support ECN, we expect that 1. there is no packet loss 2. there is no packet loss due to congestion 3. there is no packet loss due to congestion in routers 4. none of the above 5. I don t know 0% 0% 0% 0% 0% 1. 2. 3. 4. 5. 41
5. Other Bells and Whistles TCP for Very High Speed Networks (slide from Presentation: "Congestion Control on High Speed Networks, Injong Rhee, Lisong Xu, Slide 7) the figure assumes losses are detected by fast retransmit and ignores the short fast recovery phase 1.4 hours 1.4 hours 1.4 hours cwnd Packet loss Packet loss Packet loss Packet loss 100,000 10Gbps TCP 50,000 5Gbps Slow Increase Fast Decrease cwnd = cwnd + cwnd = cwnd * 1 0.5 Slow start Congestion avoidance Time (RTT)
CUBIC, FastTCP etc. Why invented? On long fat networks (very large bandwidth delay product), AIMD is too slow Solution is to increase more quickly than AIMD But only in conditions where one TCP connection can obtain a very large throughput Cubic is one such modification of TCP Behaves like TCP when RTT is less than 100msec, otherwise is more aggressive Implemented in Linux servers 43
TCP Friendly Applications UDP applications can be a threat to network performance if they generate a lot of traffic; Some UDP applications send at constant rate (e.g. a high frequency sensor); care should be done in such networks to avoid congestion collapse (control the rate and the capacity of the network) Some other UDP applications can adapt their rate (e.g. VBR video); they imitate TCP and are called TCP friendly: application determines the sending rate hence coding rate feedback ir received in form of count of lost packets sending rate is set to: rate of a TCP flow experiencing the same loss ratio, using the loss throughput formula See for example the TFRC (TCP Friendly Rate Control) protocol 44
Per Flow Queuing In general, all flows compete in the Internet using the congestion control method of TCP In special cases, it is possible to modify the competition by using per flow queuing Example 1: 2 Class Priority Queuing routers classify flows (using IP addresses and perhaps port numbers) into 2 classes class 1 has priority over class 2 Sensor PMU1 Sensor PMU2 class1 high prio S1 PC1 class 2 low prio PC2 45
Example: Priority Queuing PMU1 and PMU2 use UDP at 1 Mb/s each PC1 and PC 2 used TCP at highest possible rate PMU1 and PMU2 have highest priority: they see network as: PC1 and PC2 have lowest priority; they see network as: 46
Which rates will PC1 and PC2 achieve? Sensor PMU1 Sensor PMU2 UDP at 1 Mb/s UDP at 1 Mb/s class1 10 Mb/s 10 Mb/s high prio 10 Mb/s S1 PC1 1 TCP connection RTT= 100 msec PC2 class 2 low prio 1 TCP connection RTT = 100 msec 1. 5 Mb/s each 2. 4 Mb/s each 3. PC1: 5 Mb/s, PC2: 3 Mb/s 4. I don t know 0% 0% 0% 0% 1. 2. 3. 4. 47
Example: Class Bass Queuing Potential problem with priority queuing: high priority flows may starve low priority flows example: PMU1 has a bug and sends at 10 Mb/s PC1 and PC2 receive close to 0 Mb/s Class Based Queuing was created to solve that problem, i.e. isolate low priority from errors/attacks on high priority Class Based Queuing works as follows: routers classify flows (using IP addresses and perhaps port numbers) each class is guaranteed a maximum rate some classes may exceed the maximum rate by borrowing from other classes 49
Example of Class Based Queuing Sensor PMU1 UDP at 1 Mb/s Sensor PMU2 UDP at 1 Mb/s class 1 rate = 2.5 Mb/s S1 10 Mb/s 10 Mb/s 10 Mb/s PC1 1 TCP connection RTT= 100 msec PC2 1 TCP connection RTT = 100 msec class 2 rate = 7.5 Mb/s, borrow from total 10 Mb/s Class 1 is guaranteed a rate of 2.5 Mb/s; cannot exceed this rate Class 2 is guaranteed a rate of 7.5 Mb/s; can exceed this rate by borrowing capacity available from the total 10 Mb/s 50
Which rates will PC1 and PC2 achieve? 1. 5 Mb/s each 2. 4 Mb/s each 3. PC1: 5 Mb/s, PC2: 3 Mb/s 4. I don t know 0% 0% 0% 0% 1. 2. 3. 4. 51
Facts to remember TCP performs congestion control in end systems sender increases its sending window until loss occurs, then decreases additive increase (no loss) multiplicative decrease (loss) Negative feedback is either packet drop or ECN Negative bias towards long round trip times is partly undue UDP applications should behave like TCP with the same loss rate or should be isolated e.g. by class based queuing. 53