Congestion Source 1 Source 2 10-Mbps Ethernet 100-Mbps FDDI Router 1.5-Mbps T1 link Destination Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources
Taxonomy of approaches Router-centric vs. host-centric - hosts at the edges of the network (transport protocol) - routers inside the network (queuing discipline) Reservation based vs. feedback based - pre-allocate resources so at to avoid congestion - control congestion if (and when) is occurs Window based vs. rate based Best-effort (today) vs. multiple QoS (Thursday)
Scheduling discipline Router design issues - Which of multiple packets should you send next? - May want to achieve some notion of fairness - May want some packets to have priority Drop policy - When should you discard a packet? - Which packet to discard? - Some packets more important (perhaps BGP) - Some packets useless w/o others (cells in AAL5 CS-PDU) - Need to balance throughput & delay
Example: FIFO tail drop Arriving packet Next free buffer Next to transmit (a) Free buffers Queued packets Arriving packet Next to transmit (b) Drop Differentiates packets only by when they arrive Might not provide useful feedback for sending hosts
What to optimize for? Fairness (in two slides) High throughput queue should never be empty Low delay so want short queues Crude combination: power = Throughput/Delay - Want to convince hosts to offer optimal load Throughput/delay Optimal load Load
Connectionless flows Source 1 Source 2 Source 3 Router Router Router Destination 1 Destination 2 Even in Internet, routers can have a notion of flows - E.g., base on IP addresses & TCP ports (or hash of those) - Soft state doesn t have to be correct - But if often correct, can use to form router policies
Fairness What is fair in this situation? - Each flow gets 1/2 link b/w? Long flow gets less? Usually fair means equal - For flow bandwidths (x 1,..., x n ), fairness index: f(x 1,..., x n ) = ( n i=1 x i) 2 - If all x i s are equal, fairness is one n n i=1 x2 i So what policy should routers follow? - First, we have to understand what TCP is doing
Idea TCP Congestion Control - Assumes best-effort network - Each source determines network capacity for itself - Uses implicit feedback (dalay, drops) - ACKs pace transmission (self-clocking) Challenge - Determining the available capacity in the first place - Adjusting to changes in the available capacity
Detecting congestion Question: how does the source determine whether or not the network is congested? Answer: a timeout occurs - Timeout signals that a packet was lost - Packets are seldom lost due to transmission error - Lost packet implies congestion
Dealing with congestion TCP keeps congestion & flow control windows - Max packets in flight is lesser of two After a packet loss, must reduce cong. window - This will control congestion situation - But how much to reduce? Idea: conservation of packets at equilibrium - Want to keep roughly same number of packets in network - By analogy with water in fixed-size pipe - Put new packet into network when one exits
How much to reduce window? Let s build a crude model of network - Let L i be load of network (# pkts in contains) at time i - If network uncongested, roughly constant L i = N Now what happens under congestion? - Some fraction γ of packets can t exit network - So now L i = N + γ L i 1, or L i g i L 0 - Congestion increases exponentially (w. infinite buffers) Requires multiplicative decrease of window size - TCP choses to cut window in half
How to use extra capacity? Must adjust as extra capacity becomes available - Unlike drops for congestion, no explicit signal - Instead, try to send slightly faster, see if it works - So need to increase window when no losses how much? Multiplicative increase - But easier to saturate net than to recover (rush-hour effect) - Multiplicative so fast, will inevitably lead to saturation Additive increase won t saturate net - So Additive Increase, Multiplicative Decrease, AIMD
Additive Increase Source Destination
Implementation In practice, sending MSS-sized frames - Let window size in bytes be w, should be multiple of MSS Increase: - After w MSS bytes ACKed, could set w w + MSS - Smoother to increment window on each ACK received: w w + MSS MSS/w Decrease: - After a packet loss, w w/2 - But don t want w < MSS - So react differently to multiple consecutive losses - Back-off exponentially (pause with no packets in flight)
KB 70 60 50 40 30 20 10 AIMD trace Window trace produces sawtooth pattern: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time (seconds) 10.0
Slow start Question: Where to set w initially? - Should start at 1 MSS (to avoid overloading network) - But additive ramp-up too slow on fast net Start by doubling window each RTT - Then at most will dump one extra window into network Slow start? This sounds like fast start? - In contrast to what happened before Jacobson/Karels work - Sender would dump an entire flow control window into net Slow start used in multiple situations - Connection start time & after timeout
Slow start picture Source Destination
Slow start implementation We are doubling w after each RTT - But receiving w packets each RTT - So can set w w + MSS on every ack received Now implementation has to keep track of three limits - AvailableWindow for flow control - CongestionThreshold old congestion window - CongestionWindow smaller than threshold during slow start Slow start only up to CongestionThreshold - Remember last value - When reached, go back to additive increase
Fast retransmit & fast recovery Problem: Coarse-grain TCP timeouts - Have to be conservative about RTT - Net will sit idle while waiting for a timeout - Worse, TCP intentionally keeps bumping head against limit Solution: Fast retransmit - Use 3 duplicate ACKs to trigger retransmission - If more than one packet was lost, still need timeout - Else, halve w, but otherwise keep sending - No need to set w MSS and use slow start
Fast retransmit picture Sender Packet 1 Packet 2 Packet 3 Packet 4 Receiver ACK 1 ACK 2 Packet 5 Packet 6 ACK 2 ACK 2 ACK 2 Retransmit packet 3 ACK 6
Before fast retransmit KB 70 60 50 40 30 20 10 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time (seconds)
With fast retransmit KB 70 60 50 40 30 20 10 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Time (seconds)
TCP s strategy Congestion Avoidance - Control congestion once it happens - Repeatedly increase load in an effort to find the point at which congestion occurs, and then back off Alternative strategy - Predict when congestion is about to happen - Reduce rate before packets start being discarded - Call this congestion avoidance, instead of congestion control Two possibilities - Host-centric: TCP Vegas - Router-centric: DECbit and RED Gateways
TCP Vegas Idea: source watches for some sign that router s queue is building up and congestion will happen E.g., RTT grows or sending rate flattens. KB 70 60 50 40 30 20 10 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Time (seconds) 900 700 500 300 100 Sending KBps 1100 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Time (seconds) Queue size in router 10 5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Time (seconds)
TCP Vegas picture
KB 70 60 50 40 30 20 10 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Time (seconds) KBps 240 200 160 120 80 40 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Time (seconds)
Fair Queuing (FQ) Explicitly segregates traffic based on flows Ensures no flow consumes more than its share Variation: weighted fair queuing (WFQ) Flow 1 Flow 2 Round-robin service Flow 3 Flow 4
FQ Algorithm Suppose clock ticks each time a bit is transmitted Let P i denote the length of packet i Let S i denote the time when start to transmit packet i Let F i denote the time when finish transmitting packet i F i = S i + P i When does router start transmitting packet i? - If arrived before router finished packet i 1 from this flow, then immediately after last bit of i 1 (F i 1 ) - If no current packets for this flow, then start transmitting when arrives (call this A i ) Thus: F i = max(f i 1, A i ) + P i
For multiple flows FQ Algorithm (cont) - Calculate F i for each packet that arrives on each flow - Treat all F i s as timestamps - Next packet to transmit is one with lowest timestamp Not perfect: can t preempt current packet Example: Flow 1 Flow 2 Output Flow 1 (arriving) Flow 2 (transmitting) Output F = 8 F = 10 F = 5 F = 2 F = 10 (a) (b)
Random Early Detection (RED) Notification is implicit - just drop the packet (TCP will timeout) - could make explicit by marking the packet Early random drop - rather than wait for queue to become full, drop each arriving packet with some drop probability whenever the queue length exceeds some drop level
RED Details Compute average queue length AvgLen = (1 Weight) AvgLen + Weight SampleLen 0 < Weight < 1 (usually 0.002) SampleLen is queue length each time a packet arrives MaxThreshold MinThreshold AvgLen
AvgLen Queue length Instantaneous Average Time Smooths out AvgLen over time - Don t want to react to instantaneous fluctuations
RED Details (cont) Two queue length thresholds: if AvgLen <= MinThreshold then enqueue the packet if MinThreshold < AvgLen < MaxThreshold then calculate probability P drop arriving packet with probability P if ManThreshold <= AvgLen then drop arriving packet
Computing probability P RED Details (cont) - TempP = MaxP (AvgLen MinThreshold)/(MaxThreshold MinThreshold) - P = TempP/(1 count TempP) Drop Probability Curve: P(drop) 1.0 MaxP AvgLen MinThresh MaxThresh
Tuning RED - Probability of dropping a particular flow s packet(s) is roughly proportional to the share of the bandwidth that flow is currently getting - MaxP is typically set to 0.02, meaning that when the average queue size is halfway between the two thresholds, the gateway drops roughly one out of 50 packets. - If traffic is bursty, then MinThreshold should be sufficiently large to allow link utilization to be maintained at an acceptably high level - Difference between two thresholds should be larger than the typical increase in the calculated average queue length in one RTT; setting MaxThreshold to twice MinThreshold is reasonable for traffic on today s Internet
FPQ Problem: Tuning RED can be slightly tricky Observations: - TCP performs badly with window size under 4 packets: Need 4 packets for 3 duplicate ACKs and fast retransmit - Can supply feedback through delay as well as through drops Solution: Make buffer size proportional to #flows - Few flows = low delay; Many flows = low loss rate - Router automatically adjusts, far less tricky tuning required - Window size is a function of loss rate, keep min size - Transmit rate = Window size / RTT, RTT Qlen Clever algorithm estimates number of flows - Hash flow info, set bits, decay - Requires reasonable amount of storage
XCP New proposed IP protocol: XCP - Not compatible w. TCP, requires router support - Idea: Have router tell us exactly what we want to know! Packets contain: cwnd, RTT, feedback field Router tells you whether to increase or decrease rate - Give explicit rates for increase/decrease amounts - Later routers don t override bottleneck router - Feedback returned to sender in ACKs