Islamic University of Gaza Faculty of Engineering Department of Computer Engineering ECOM 4021: Networks Discussion Chapter 5 - Part 2 End to End Protocols Eng. Haneen El-Masry May, 2014
Transport Layer Transport layer turns the host-to-host packet delivery service of the underlying network into a process-to-process communication channel. Common properties that application processes expect a transport protocol to provide: Guarantees message delivery Delivers messages in the same order they were sent o Delivers at most one copy of each message. Supports arbitrarily large messages. Supports multiple application processes on each host. Typical limitations of the network on which the transport protocol operates: Drop messages. Reorder messages. Deliver duplicate copies of a given message. Limit messages to some finite size. Deliver messages after an arbitrarily long delay. The challenge for transport protocols is to develop algorithms that turn the lessthan-desirable properties of the underlying network into the service required by application programs. User Datagram Protocol (UDP) UDP simply extends the host-to-host delivery service of IP into a process-toprocess communication service. UDP adds a level of demultiplexing which allows multiple application processes on each host to share the network. UDP adds no other functionality to the best-effort IP service. UDP provides unreliable connectionless service. An application process is identified by a <port, host> pair. 2
UDP datagram format SrcPort/DestPort indicates the port for the source/destination process. Length: number of bytes in the UDP datagram, including the header and the data. Checksum: computed over the entire UDP datagram and the pseudoheader. The pseudoheader consists of source IP address, destination IP address, and protocol number from the IP header plus the UDP length field. The pseudoheader is used to verify that the datagram has been delivered between the correct two endpoints. UDP checksum is optional in IPv4, mandatory in IPv6. UDP checksum is set to zero if not used. The Transmission Control Protocol (TCP) TCP offers a reliable, connection-oriented, byte-stream service. Reliable, in-order delivery of a stream of bytes Full duplex operation Includes a flow control mechanism that keeps the sender from over- running the receiver Implements a congestion control mechanism that keeps the sender from overloading the network TCP uses the sliding window algorithm on an end-to-end basis to provide reliable and ordered delivery. However, because TCP runs over the Internet rather than a point-to-point link, there are many important differences. End-to-end issues TCP supports logical connections between processes running on any two computers in the Internet. o Need explicit connection establishment and teardown 3
TCP connections may have widely different RTTs, and RTT may vary during a single TCP connection o Need adaptive timeout mechanism Potentially long delay in the network Need to be prepared for very old packets to suddenly show up at the receiver, potentially confusing the sliding window algorithm. Potentially different capacity at destination host Each side needs to learn how much buffer space the other side can allocate to the connection (i.e., flow control) Network is shared by many hosts Need to be prepared for network congestion TCP is a byte-oriented protocol: the sender writes bytes into a TCP connection and the receiver reads bytes out of the TCP connection TCP on the source host buffers enough bytes from the sending process to fill a reasonably sized packet and then sends this packet to its peer on the destination host TCP on the destination host then empties the contents of the packet into a receive buffer, and the receiving process reads from this buffer at its leisure The packets exchanged between TCP peers are called segments TCP segment format SrcPort/DstPort identify the source/destination port A TCP connection is uniquely identified by the 4-tuple <srcport, SrcIPAddr, DstPort, DstIPAddr> 4
SequenceNum: the sequence number for the first byte of data carried in the segment. Acknowledgement: the next sequence number expected AdvertisedWindow: number of bytes, beginning with the sequence number indicated in the Acknowledgement field, that the receiver is able to accept HdrLen: length of the header in 32-bit words Flags SYN: used in connection establishment FIN: used in connection termination RESET: used when one side wants to abort the connection ACK: set when the Acknowledgement field is valid URG: indicate that this segment contains urgent data o Urgent data is contained at the front of segment body, before the nonurgent data o UrgPtr indicates the number of bytes in urgent data PUSH: indicates that the sending process wants TCP to send whatever bytes it had collected to its peer. Checksum: computed over the entire TCP segment and the pseudoheader The pseudoheader consists of source IP address, destination IP address, and protocol fields from the IP header plus a TCP length field (length of the TCP header and data measured in bytes) Required in both IPv4 and IPv6. Options: up to 40 bytes, attached after the mandatory fields. 5
Connection Establishment Before a client attempts to connect with a server, the server must first bind to and listen at a port: this is called a passive open. Once the passive open is established, a client may initiate an active open. The three-way handshake occurs during connection establishment: Each side selects an initial sequence number at random A timer is scheduled for SYN and SYN+ACK segments so that they can be retransmitted upon timeout Connection termination Each side independently closes its half of the connection by sending a FIN segment If one side closes the connection, it can no longer send data, but it still can receive data from the other side A timer is scheduled for FIN segment, FIN segment is retransmitted upon timeout. Timed_Wait is used to prevent confusion due to delayed duplicate FIN packet from the other side being delivered during a subsequent connection. 6
TCP s Sliding Window Algorithm Provides reliable delivery, in-order delivery, and flow control. Reliable and ordered delivery Send buffer stores data that has been sent but not yet ACKed, as well as data that has been written by the sending application but not transmitted. Receive buffer holds data that arrives out of order, as well as data that is in the correct order but that the application process has not had the chance to read. 7
Flow control MaxRcvBuffer denotes the size of the receive buffer MaxSendBuffer denotes the size of the send buffer Receive side must keep LastByteRcvd LastByteRead MaxRcvBuffer to avoid overflowing its buffer. Receiver advertises a window size of: AdvertisedWindow = MaxRcvBuffer -((NextByteExpected-1) -LastByteRead) AdvertisedWindow indicates the amount of free space remaining in the receive buffer. Sender must ensure the number of outstanding bytes is no larger than AdvertisedWindow, that is: LastByteSent LastByteAcked AdvertisedWindow Sender computes an effective window that limits how much data it can send: EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked) If EffectiveWindow > 0, sender can send more data. Sender must ensure the application process does not overflow the send buffer, i.e., LastByteWritten LastByteAcked MaxSendBuffer TCP blocks the sending process if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer, where y is the number of bytes the sending process tries to write to TCP. When AdvertisedWindow = 0 The sending side periodically sends a probe segment with one byte of data. 8
Each probe segment triggers a response that contains the current advertised window. Protecting against wraparound TCP has satisfied the requirement that the sequence number space be twice as big as the window size (2 32 >> 2 X 2 16 ) TCP also needs to make sure the sequence number does not wrap around within the Maximum Segment Lifetime (MSL=120 seconds). Time until wraparound depends on how fast data can be transmitted over the network. TCP uses the 32-bit timestamp option to effectively extend the sequence number space. TCP reads the system clock when it is about to send a segment, and puts this time in the segment s header. TCP accepts or rejects a segment based on a 64-bit identifier that has the SequenceNum field in the low-order 32 bits and the timestamp in the highorder 32 bits. The timestamp serves to distinguish between two different incarnations of the same sequence number. Keeping the pipe full The advertised window need allow a full RTT x bandwidth product s worth of data to be transmitted (i.e., keep the pipe full). 16-bit AdvertisedWindow field allows receiver to advertise a window of only 64KB, which is not big enough for high-speed networks. 9
TCP uses the window scale option to effectively increase the size of the advertised window. The option defines a scaling factor for the advertised window that allows the two sides to agree that the AdvertisedWindow field counts larger chunks (e.g., 16-byte units) of data the sender can have unacked. In other words, the option specifies how many bits each side should leftshift the AdvertisedWindow field before using its contents to compute an effective window The scaling factor has a maximum value of 14 bit, so the maximum window size is 230 byte = 1 gigabyte. Triggering Transmission It is up to TCP to decide that it has enough bytes to send a segment. Assume the window is wide open, TCP has 3 mechanisms to trigger the transmission of a segment: 1. TCP maintains a variable called maximum segment size (MSS) and sends a segment as soon as it has collected MSS bytes from the sending process. MSS is usually set to the size of the largest segment TCP can send without causing local IP to fragment, i.e., MSS = MTU of directly connected network - TCP header size - IP header size 2. Sending process has explicitly asked TCP to send it using the push operation 3. When a timer fires Resulting segment contains as many bytes as are currently buffered for transmission 10
Silly Window Syndrome (SWS): transmission of small segments because either the receiver advertises a small window or the sender transmits a small segment. SWS makes data transmission extremely inefficient Receive-side SWS avoidance: Clark s solution Receiver closes the window until the buffer is half empty or the available buffer space is equal to MSS. Send-side SWS avoidance: Nagle s algorithm If there is data to send but the window is open less than MSS, then wait some amount of time before sending the available data. But how long? Nagle introduced an elegant self-clocking solution: As long as TCP has any data in flight, the sender will eventually receive an ACK. This ACK can be treated like a timer firing, triggering the transmission of more data. Nagle s algorithm: When the application produces data to send if both the available data and the window MSS send a full segment else if there is unacked data in flight buffer the new data until an ACK arrives else send all the new data now 11
Adaptive Retransmission Given the range of possible RTTs between any pair of hosts in the Internet, as well as the variation in RTT between the same two hosts over time, TCP uses an adaptive retransmission mechanism. Timeout value is set as a function of the estimated RTT between a pair of hosts. Original algorithm Measure SampleRTT for each segment/ack pair. Compute weighted average between the previous estimate and the new sample: EstimatedRTT = α x EstimatedRTT + (1-α) x SampleRTT (α between 0.8 and 0.9) TimeOut = 2 x EstimatedRTT Problem: When a segment is retransmitted and then an ACK arrives at the sender It is impossible to decide if this ACK should be associated with the first or the second transmission for measuring the sample RTT. Karn/Partridge Algorithm Do not measure SampleRTT when retransmitting Doubles timeout after each retransmission Motivation: TCP source should not react too aggressively to a timeout since congestion is the most likely cause of lost segments Jacobson/Karels Algorithm Takes the variance of the sample RTTs into account If the variance among SampleRTTs is small: 12
The Estimated RTT can be better trusted. There is no need to multiply it by 2 to compute the timeout On the other hand, a large variance in SampleRTTs suggest that timeout value should not be tightly coupled to the Estimated RTT. Calculating the timeout: TCP Extensions There are extensions to TCP that are realized as options that can be added to the TCP header. Two hosts may agree to use the options during TCP connection establishment phase. RTT Measurement Option Used to accurately measure RTT TCP reads the system clock when it is about to send a segment, and puts this time (a 32-bit timestamp) in the segment s header 13
Receiver echoes the timestamp back in its ACK Sender subtracts the timestamp from the current time to measure the RTT Protect against Wrapped Sequence Numbers Option Uses the 32-bit timestamp to effectively extend the sequence number space Window Scale Option Allows TCP to advertise a larger window Selective Acknowledgment (SACK) Option Allows TCP to augment its cumulative ACK with selective ACK of any additional segments that have been received but aren t contiguous with all previously received segments Without SACK, there are only two reasonable strategies for a sender: The pessimistic strategy responds to a timeout by retransmitting not just the segment that timed out, but any segments transmitted subsequently. The optimistic strategy responds to a timeout by retransmitting only the segment that timed out. With the SACK option, sender can retransmit just the segments that fill the gaps between the segments that have been selectively ACKed. 14