Documents. Configuration. Important Dependent Parameters (Approximate) Version 2.3 (Wed, Dec 1, 2010, 1225 hours)

1 of 7 12/2/2010 11:31 AM Version 2.3 (Wed, Dec 1, 2010, 1225 hours) Notation And Abbreviations preliminaries TCP Experiment 2 TCP Experiment 1 Remarks How To Design A TCP Experiment KB (KiloBytes = 1,000 bytes)... Note: this is non-standard since normally K means 1,024 when referring to size Kb (Kilobits = 1,000 bits) Kbps (Kilobits per second = 1,000 bits per second) Kpps (Kilopackets per second = 1,000 packets per second MB, Mb, Mbps where M stands for Mega (1,000,000) msec (millisecond = 0.001 sec) pkts (packets) cwnd (TCP's congestion window size) ~= (approximately equal to) Documents Lab 3 Lab 3 Experiment File Raphael's Lab 3 Notes Configuration ------ ------ 12 Mbps n1p2 ----> 2 4 --------> 1 2 ----> n2p2 q64 ------ ------ q64 thresh = 10 MB Important Dependent Parameters (Approximate) Some of these parameters are used in doing a back-of-the-envelope prediction of how the TCP flow will

2 of 7 12/2/2010 11:31 AM behave. Assume Packet length ~= 1500 bytes = 1.5 KB TCP New Reno with SACK Auto send buffer tuning is turned off because iperf -w calls setsocketopt(sndbuf) which turns off auto send buffer tuning Bottleneck Rate = 1 Kpps = 12 Mbps = 12 Mbps / (12 Kb/pkt) = 1 Kpps since each pkt is 1.5 KB or 12,000 bits. Bottleneck Transmission Delay = 1 msec = length/(transmission rate) = 1 pkt / (1 Kpps) = 1 msec Propagation Delay ~= 0 msec Since there is no delay plugin along the pkt path, there is negligible propagation delay. Maximum Queue Length = 20/3 Kpkts = 10 MB = 10 MB / (1.5 KB/pkt) = 20/3 Kpkts Maximum Queueing Delay = 20/3 sec = max queue length / transmission rate = ( 20/3 Kpts ) / 1 Kpps = 20/3 sec This is also the time it would take to completely drain a full queue at the end of the flow. 20-sec Transmission Volume = 30 MB The iperf transmission period is 20 sec. If the bottleneck is continuously backlogged for 20 sec and the sender transmits 1 packet for each ACK, it will transmit: = 1 Kpps x 20 sec = 20 K pkts = 20 K pkts x (1.5 KB/pkt) = 30 MB During the 20-sec iperf transmission period, it may transmit at a higher rate than the bottleck until it detects a packet drop. Back-Of-The-Envelope Calculations Assume normal TCP behavior which means it goes through the following phases: slow-start with cwnd doubling every RTT ending in pkt drops cwnd' = cwnd/2 (congestion window halved)

3 of 7 12/2/2010 11:31 AM congestion avoidance with fast-retransmit-fast-recovery Round X Slow-Start Duration = 2^X msec Slow-start goes through N rounds sending 2^X pkts in round X where X starts at 0. The last pkt in round X must wait for the 2^X pkts in front of it at the bottleneck queue and then spend 1 msec in transmission. Since the bottleneck transmission delay is 1 msec, the last pkt in round X spends 2^X msec at the bottleneck. The 2-way propagation delay is 0 msec. Note that the sender is sending packets in round X while the bottleneck is transmitting packets from round X-1. Round X Sender Transmission Rate = 2 Kpps = (Number of pkts transmitted) / (Round X-1 duration) = 2^X / (2^(X-1) msec) = 2 Kpps This is expected during slow-start since cwnd is incremented for each ACK allowing the sender to send out 2 packets for each ACK it receives; i.e., ACKs arrive to the sender at a rate of 1 Kpps, the transmission rate of the bottleneck. Since the sender sends out 2 packets for each ACK it receives during slow-start, its sending rate will be 2 Kpps. Queueing Rate = 1 Kpps = (Input Rate) - (Bottleneck Rate) = 2 Kpps - 1 Kpps = 1 Kpps Maximum Slow-Start Duration For 10 MB Queue ~= 6.7 sec The sender is continuously sending during slow-start because there is no propagation delay. So, the duration of the slow-start period is: = 6.667 Kpkts / (1 Kpps) = 6.7 sec Backlogged Packet Drop Rate = 0.5 As soon as the queue is full, every other arriving packet will be dropped since the sender is sending at 2 Kpps and the bottleneck is draining at only 1 Kpps. ACK Rate = 0.38, 0.45, or 0.51 Mbps The ACK packet rate during slow-start is the same as the bottleneck packet rate; i.e., 1 Kpps. The minimum ACK packet has a timestamp in the TCP options portion. If there are SACK blocks then add 8 bytes for each SACK block. Since typically there is a timestamp and n SACK blocks, the ACK packet length is (48+8n) bytes where n is 0, 1 or 2: 20 bytes of IP header, 20 bytes of TCP header, and 8+8n bytes of TCP options. So, the ACK rate in Mbps will be:

4 of 7 12/2/2010 11:31 AM = ( (SACK pkt len)/(data pkt len) ) (Bottleneck transmission rate) = ((48+8n)/1500) 12 Mbps, n = 0, 1, 2 = 0.64 Mbps = 0.38 Mbps, 0.45 Mbps, or 0.51 Mbps for n = 0, 1, 2 respectively. We cover Experiment 2 first since the size of the sender's send buffer (10 MB = 20/3 Kpkts) allows slow-start to continue to double cwnd until the first packet drop. The figures show most of the expected behavior: Bandwidth RXBYTE 1.2 has 3 regions 425-427 (2 sec): Part 1 of slow-start in which the sender sends at 24 Mbps (= 2 Kpps) as expected. 427-431 (4 sec): Part 2 of slow-start ending with packet drops in which the sender sends at 18 Mbps (= 1.5 Kpps). The sender slows down during this period since the ACKs come back at half the rate that it initially did. Warning: It's not clear why the ACK rate decreases during this period. 431-450 (19 sec): Congestion avoidance 449-453 (4 sec): Bottleneck queue drains with no arrivals RXBYTE 2.1 is a constant 12 Mbps (the bottleneck rate), and lasts 4 sec longer than the transmission rate curve. The 4 sec is expected since the queue length is 6.5 MB (see Queue Length below) or about 4.3 Kpkts; since the drain rate is 1 Kpps, it should take about 4.3 sec to drain the last part of the queue. The RXBYTE 2.1 chart shows 12 Mbps and lasts for 4 sec longer than the RXBYTE 1.2 chart. The bottleneck queue is draining with no arrivals during this period. A 6.5 MB (= 13/3 Kpkts) queue should drain in 13/3 sec since the drain rate is 1 Kpps. ACK Rate The rate is about 0.4 Mbps for the first two seconds. It appears that there is one SACK block in the TCP options field. The rate is about 0.2 Mbps for the remainder of the flow. I'm not sure what is causing this throttling. This may be due to the receiver sending 1 ACK for every other packet received. But this wouldn't explain why the sending rate backs off from 24 Mbps to 18 Mbps for 4 sec. So, there might be some resource depletion at the receiver or sender. For example, if the receiver process gets interrupted (even momentarily), its receive buffers will fill up and it will signal to the sender that it has no room to accept new packets. Warning: The only way to tell if this is the case is to run tcpdump to get a packet trace. Queue Length There are 4 regions (approximate durations): 425-427 (2 sec): Part 1 of slow-start 427-431 (4 sec): Part 2 of slow-start

5 of 7 12/2/2010 11:31 AM The decrease in slope is due to the decreased input rate of 18 Mbps (= 1.5 Kpps) from 24 Mbps (= 2 Kpps). 431-450 (19 sec):??? Congestion avoidance??? This region is probably not congestion avoidance since it is unlikely that there were packet drops at the router... the input rate drops to 18 Mbps (from 24 Mbps) and then to 12 Mbps (the bottleneck rate) about 6 sec from the start of the flow -- too soon. Since the monitoring period is 0.25 sec and the bottleneck drain rate is 12 Mbps (= 1 Kpps), there is enough accuracy in the queue length chart to detect a 10 MB queue. 449-453 (4 sec): Bottleneck queue drains at 1 Kpps (= 12 Mbps = 1.5 MBps) with no arrivals. Maximum queue length = 6.5 MB ~= 4.3 Kpkts The queue length never reaches its maximum of 10 MB (= 6.667 Kpkts). This appears to be strange but note that if there are 4.3 Kpkts in the bottleneck queue and one-half of the packets have been dropped, the sender thinks that there are 8.6 Kpkts or 12.9 MB in-flight. Since the user requested a send buffer of 10 MB (= 6.667 Kpkts) but Linux gave him/her 20 MB (= 13.333 Kpkts), the send buffer usage has crossed past the 10 MB boundary. The only difference between this experiment and Experiment 2 is that the sender's buffer is limited to about 3 MB (= 2 Kpkts) and therefore cwnd will be limited to 2 Kpkts. This means that you can't have more than 2 Kpkts in flight that are new packets. (Actually, the sender is given 6 MB when you use "-w 3m" but some of those buffers are used for things other than unacknowledged data packets. So, the sender will be able to have cwnd between 2 Kpkts and 4 Kpkts.) We do see that the "Queue Length" chart shows a queue length of between 3 MB and 4.5 MB. The "ACK Rate" chart is the same as in Experiment 2. The perplexing chart is the "Bandwidth" chart. The first two seconds of the flow shows 24 Mbps which is expected. But the 24 Mbps only lasts for 2 sec, not the 6.7 sec we computed earlier for the "-w 10m" case. But then, the rate appears to oscillate with periodic spikes that peak around 51 Mbps, 44 Mbps, 38 Mbps, 31 Mbps, 20 Mbps, and 7 Mbps. Let's see what should happen 2 sec after the flow has started. 2 sec after the flow begins, the number of packets queued should be: = 2 sec (1 Kpps) = 2 Kpkts = 2 Kpkts (1.5 KB/pkt) = 3 MB If the send buffer is 3 MB, it will be exhausted 2 sec after the start of the flow. When this occurs, cwnd can not be incremented for each incoming ACK. Instead, it will remain constant, and only 1 packet will be transmitted for each ACK received, not 2 packets per ACK as in Experiment 2. We would expect that the "RXBYTE 1.2" chart should now be 12 Mbps. But instead it shows packet bursts alternating with small idle periods.

6 of 7 12/2/2010 11:31 AM I suspect that when the send buffer space is depleted, the send() call will be blocked and a context switch to the scheduler will occur, causing some idle period. When the sender gets the CPU again, ACKs have freed some buffer space and the sender bursts out packets until the buffer space is depeleted again. This is not good behavior. The only way to see that this is really happening is to run tcpdump and observe time gaps in the packet trace. The poor configuration from a TCP standpoint of the two TCP exercises hides the most important features of TCP: slow-start and congestion avoidance. In both cases, either: 1) the -w value is too small or equivalently, 2) the bottleneck queue capacity (threshold) is too big. Personally, I think the bottleneck queue capacity (threshold) is too big. Most people (even networking instructors) who don't deal with TCP at a detailed level don't really have a good understanding of basic TCP behavior. Textbooks really cover (at a cursory level) the case of many, well-behaved, well-configured TCP flows converging onto a bottleneck link. But more often than not, packet drops will occur in bursts and/or there will be resource constraints; i.e., the exception cases. A TCP flow does not behave well in the face of a large packet drop fraction when the RTT is large because it normally depends on the fast retransmit, fast recovery algorithm which retransmits one packet per RTT. (Note: The drop rate is 1/2 because of slow-start and no propagation delay.) SACK improves the recovery, but the limited size of the TCP options field Where the SACK info is stored prevents fast recovery! But even though your two TCP experiments probably don't experience packet drops for the reasons stated above, the configurations cause other problems that hide the core, standard TCP behavior. The 10 MB bottleneck queue at port 1.4 aggravates TCP because the queueing delay is 6.7 sec for a full queue that drains at 12 Mbps. This leads to an RTT of 6.7 sec which is really excessive. Using such a large queue for a single TCP flow is really crazy and will lead to bizarre TCP behavior which can only be explained by looking at a detailed packet trace. It would help to monitor the number of packet drops at the bottleneck to get a better understanding of what is happening. See the TCP Example in the NPR Tutorial for how to do that. Tcpdump is a nice tool, but not necessary if the experiment is properly configured. If you really want to understand the minute details, running tcptrace on a tcpdump file in conjunction with the visualization tool xplot will give you a rich description of what is going on at the packet level. But they're more suitable for a graduate course in networking. However, a lecture that uses the output of these tools would be instructive. The lab assignment doesn't use the delay plugin to emulate propagation delay. This is acceptable for a first simple lab in TCP, but introducing a delay would make it more realistic. Also to be realistic, you would want to use a few TCP flows. But if the configuration is properly chosen, many TCP features can still be shown using only one flow. But a goal of realistic flows is not really necessary to have an experiment in which students can gain some understanding of important TCP features. Your use of the phrase "TCP window size" when refering to the -w iperf flag is incorrect and misleading (Yes, iperf uses the terminology.). What window are you referring to? It can't be cwnd, the congestion window size, because that depends on various algorithms (slow-start, congestion avoidance, fast recovery). Although iperf calls the -w flag the window size, it really controls the size of the send buffer which really puts a limit on the number of packets in-flight and also limits the maximum value of cwnd.

7 of 7 12/2/2010 11:31 AM In designing any assignment, you should work backwards: Write down the important concepts you want the students to understand. Find a configuration(s) that can be used to illustrate these concepts. Determine the student activity that will exercise these ideas. Do back-of-the-envelope calculations of key parameters to verify that your experiment will perform as expected. Do the assignment to see how well it meets the desired goals. Repeat the above steps until you have a good assignment. I consider the following concepts to be the most important TCP comcepts: Slow-start algorithm Congestion detection Congestion avoidance algorithm Fast retransmit, fast recovery algorithm Self-clocking nature of TCP Bandwidth-delay product and high-performance Fair-sharing among multiple flows That's a lot of concepts. So, a first lab needs to focus on a few of the most important ones. The other issue that needs to be considered is that ONL monitoring tools have a coarse granularity of 0.25 sec or 1 sec. So, parameters must be chosen so that you can see a non-zero queue length over a few seconds. For example, you could set the bottleneck queue to 30 KB (= 20 pkts) to get quick congestion detection (via packet drops), but ONL will not show the build up of a queue unless the bottleneck rate is quite low. Suppose that we want a bottleneck rate of 12 Mbps (= 1 Kpps). If you are monitoring with a period of 0.25 sec, you want to see a non-zero queue length for atleast 0.5 sec. So, you will want the bottleneck queue to be sized such that: Queue Size >= 0.5 sec (1.5 Kpps) = 750 pkts >= 750 pkts (1.5 KB/pkt) = 1.125 MB The queueing delay for this queue is 0.5 sec = 500 msec. The sender's buffer should be large enough to accommodate the maximum bandwidth-delay product (BDP) of 375 pkts = 1.125 MB. To be safe, choose a send buffer of atleast 2.25 MB to allow for packets during fast recovery. This will allow the fast recovery algorithm to inflate cwnd so that new packets can be injected into the network even though the sender thinks there are already a BDP worth of packets in-flight.