TCP/IP Networking Part 4: Network and Transport Layer Protocols Orientation Application Application protocol Application TCP TCP protocol TCP IP IP protocol IP IP protocol IP IP protocol IP Network Access Data Link Network Access Network Access Data Link Network Access Network Access Data Link Network Access Host Router Router Host 1
Network Access TCP UDP Transport Layer ICMP IP IGMP Network Layer ARP Network Access RARP Link Layer Media Processing of IP packets by network drivers IP Output IP Input Put on IP input queue Yes IP destination = multicast or broadcast? Put on IP input queue Yes loopback Driver No IP destination of packet = local IP address? No: get MAC address with ARP ARP Ethernet Driver ARP Packet IP datagram demultiplex Ethernet Frame Ethernet 2
Ethernet Encapsulation (RFC 894) 802.3 MAC destination address source address type data CRC 6 6 2 46-1500 4 0800 IP datagram 2 38-1492 0806 ARP request/reply PAD 2 28 10 0835 RARP request/reply PAD 2 28 10 IP Protocol and its Helpers TCP UDP Transport Layer ICMP IP IGMP Network Layer ARP Network Access RARP Link Layer Media 3
IP Service IP provides an unreliable and connectionless service ( datagram service ). Unreliable: IP does not guarantee that a transmitted packet will be delivered. Connectionless: Each packet ( datagram ) is handled independently. IP is not aware that packets between hosts may be sent in a logical sequence. Consequences of an unreliable, connectionless service. Lost packets Packets are delivered out-of-sequence Duplicate packets IPv4 Datagram Format 20 bytes <= Header Size <= 2 4 * 32 bit-words = 60 bytes 20 bytes <= Total Length <= 2 16 bytes = 65536 bytes bit # 0 7 8 15 16 23 24 31 >= five 32-bit words version (4 bits) header length Type of Service/TOS (8 bits) Identification (16 bits) TTL Time-to-Live (8 bits) Protocol (8 bits) flags (3 bits) Source IP address (32 bits) Destination IP address (32 bits) Options (if any, <40 bytes) Total Length (in bytes) (16 bits) Fragment Offset (13 bits) Header Checksum (16 bits) DATA 0 31 32-bit word 4
Orientation The IP (Internet Protocol) relies on several other protocols to perform necessary control and routing functions. EGP RIP BGP OSPF Routing ICMP IGMP Control ICMP The Internet Control Message Protocol (ICMP) is the protocol used for error and control messages in the Internet ICMP provides an error reporting mechanism of routers to the sources All ICMP packets are encapsulated as IP datagrams The packet format is simple: Type (8 bits) Code (8 bits) Checksum (16 bits) (additional information dependent on Type and Code) 0 31 32-bit word 5
ICMP Message Types ICMP messages are either query messages or error messages ICMP query messages: Echo request / Echo reply Router advertisement / Router solicitation Timestamp request / Timestamp reply Address mask request / Address mask reply ICMP error messages: Host unreachable Source quench Time Exceeded Parameter Problem Echo Request and Reply PING (=Packet IntetNet Groper) is a program that utilizes the ICMP echo request and echo reply messages. PING s are handled directly by the kernel Each Ping is translated into an ICMP Echo Request The Ping ed host responds with an ICMP Echo Reply ICMP ECHO REQUEST viper viper ICMP ECHO REPLY mamba mamba 6
Format of Echo Request and Reply Type (=0 or 8) Code (=0) Checksum identifier sequence number optional Identifier is set to process Id of querying process Sequence number is incremented for each returning packet Transport Protocols User Process User Process User Process User Process Application Layer TCP UDP Transport Layer ICMP IP IGMP Network Layer ARP Hardware Interface RARP Link Layer Media 7
Orientation Transport layer protocols are end-to-end protocols They are only implemented at the hosts HOST Application HOST Application Transport Transport Network Network Network Data Link Data Link Data Link Data Link Transport Protocols in the Internet The Internet supports two transport protocols UDP - User Datagram Protocol datagram oriented unreliable, connectionless simple unicast and multicast useful only for few applications, e.g., multimedia applications used a lot for services network management (SNMP), routing (RIP), naming (DNS), etc. TCP - Transmission Control Protocol stream oriented reliable, connection-oriented complex only unicast used for most Internet applications: web (http), email (smtp), file transfer (ftp), terminal (telnet), etc. 8
UDP - User Datagram Protocol TCP UDP Transport Layer ICMP IP IGMP Network Layer ARP Network Access RARP Link Layer Media UDP - User Datagram Protocol UDP is supports unreliable transmissions of datagrams UDP merely extends the host-to-to-host delivery service of IP datagram to an application-to-application service The only thing that UDP adds is multiplexing and demultiplexing Applications Applications UDP UDP IP IP IP IP IP 9
UDP Format IP header UDP header UDP data 20 bytes 8 bytes Source Port Number UDP message length Destination Port Number Checksum DATA 0 15 16 31 Port Numbers UDP (and TCP) use port numbers to identify applications A globally unique address at the transport layer (for both UDP and TCP) is a tuple <IP address, port number> There are 65,535 UDP ports per host. User Process User Process User Process User Process User Process User Process TCP UDP Demultiplex based on port number IP Demultiplex based on Protocol field in IP header 10
TCP - Transmission Control Protocol TCP UDP Transport Layer ICMP IP IGMP Network Layer ARP Network Access RARP Link Layer Media TCP Topics Connection-oriented reliable byte stream TCP Connection Management 3-way handshake TCP Flow Control Sliding Window Flow Control TCP Congestion control To be discussed in more detail later TCP Error control ARQ Retransmission schemes 11
What is Flow/Congestion/Error Control? Flow Control: Algorithms to prevent that the sender overruns the receiver with information? Congestion Control: Algorithms to prevent that the sender overloads the network Error Control: Algorithms to recover or conceal the effects from packet losses Connection-oriented reliable byte stream 12
Overview TCP = Transmission Control Protocol Connection-oriented protocol Provides a reliable unicast end-to-end byte stream over an unreliable internetwork. Byte Stream Byte Stream TCP TCP IP Internetwork Connection-oriented reliable byte stream Before any data transfer, TCP establishes a connection: One TCP entity is waiting for a connection ( server ) The other TCP entity ( client ) contacts the server The actual procedure for setting up connections is more complex. Each connection is full duplex CLIENT SERVER Request a connection Accept a connection waiting for connection request Data Transer Disconnect 13
Connection-oriented reliable byte stream Byte stream is broken up into chunks which are called segments Receiver sends acknowledgements (ACKs) for segments TCP maintains a timer. If an ACK is not received in time, the segment is retransmitted Detecting errors: TCP has checksums for header and data. Segments with invalid checksums are discarded Each byte that is transmitted has a sequence number Connection-oriented reliable byte stream To the lower layers, TCP handles data in blocks, the segments. To the higher layers TCP handles data as a sequence of bytes and does not identify boundaries between bytes Application 1. write 100 bytes 2. write 20 bytes Application 1. read 40 bytes 2. read 40 bytes 3. read 40 bytes TCP queue of bytes to be transmitted Segments TCP queue of bytes that have been received 14
TCP Format TCP segments have a 20 byte header with >= 0 bytes of data. IP header TCP header TCP data 20 bytes 20 bytes 0 15 16 31 Source Port Number Destination Port Number Sequence number (32 bits) header length Acknowledgement number (32 bits) 0 Flags TCP checksum window size urgent pointer 20 bytes Options (if any) DATA TCP Connection Management Opening a TCP Connection Closing a TCP Connection State Diagram 15
TCP States in Normal Connection Lifetime SYN_SENT (active open) SYN (SeqNo = x) SYN (SeqNo = y, AckNo = x + 1 ) LISTEN (passive open) SYN_RCVD (AckNo = y + 1 ) ESTABLISHED FIN_WAIT_1 (active close) FIN_WAIT_2 TIME_WAIT FIN (SeqNo = m) (AckNo = m+ 1 ) FIN (SeqNo = n ) (AckNo = n+1) ESTABLISHED CLOSE_WAIT (passive close) LAST_ACK CLOSED Three-Way Handshake aida.poly.edu mng.poly.edu S 1031880193:1031880193(0) win 16384 <mss 1460,...> S 172488586:172488586(0) ack 1031880194 win 8760 <mss 1460> ack 172488587 win 17520 16
. ack 172488735 win 17484 Why is a Two-Way Handshake not enough? S 1031880193:1031880193(0) win 16384 <mss 1460,...> S 15322112354:15322112354(0) win 16384 <mss 1460,...> The red line is a delayed duplicate packet. aida.poly.edu mng.poly.edu S 172488586:172488586(0) win 8760 <mss 1460> Will be discarded as a duplicate SYN When aida initiates the data transfer (starting with SeqNo=15322112355), mng will reject all data. TCP Connection Termination aida.poly.edu F 172488734:172488734(0) ack 1031880221 win 8733 mng.poly.edu F 1031880221:1031880221(0) ack 172488735 win 17520. ack 1031880222 win 8733 17
TCP States State Description CLOSED No connection is active or pending LISTEN The server is waiting for an incoming call SYN RCVD A connection request has arrived; wait for Ack SYN SENT The client has started to open a connection ESTABLISHED Normal data transfer state FIN WAIT 1 Client has said it is finished FIN WAIT 2 Server has agreed to release TIMED WAIT Wait for pending packets ( 2MSL wait state ) CLOSING Both Sides have tried to close simultanesously CLOSE WAIT Server has initiated a release LAST ACK Wait for pending packets TCP State Transition Diagram Opening A Connection passive open send:. /. CLOSED close or timeout active open send: SYN recv: RST LISTEN recv: SYN send: SYN, ACK Application sends data send: SYN send: FIN SYN RCVD recvd: ACK send:. /. recvd: FIN simultaneous open recv: SYN send: SYN, ACK ESTABLISHED send: FIN SYN SENT recv: SYN, ACK send: ACK 18
TCP State Transition Diagram Closing A Connection active open send: FIN ESTABLISHED passive close recv: FIN send: ACK FIN_WAIT_1 recv: ACK send:. /. recv: FIN, ACK send: ACK recv: FIN send: ACK CLOSING recvd: ACK send:. /. CLOSE_WAIT application closes send: FIN LAST_ACK FIN_WAIT_2 recv: FIN send: ACK TIME_WAIT Timeout (2 MSL) CLOSED recv: ACK send:. /. TCP Flow Control 19
TCP Flow Control TCP implements sliding window flow control Sending acknowledgements is separated from setting the window size at sender. Acknowledgements do not automatically increase the window size Acknowledgements are cumulative Sliding Window Flow Control Sliding Window Protocol is performed at the byte level: Advertised window 1 2 3 4 5 6 7 8 9 10 11 sent and acknowledged sent but not acknowledged can be sent USABLE WINDOW can't sent Here: Sender can transmit sequence numbers 6,7,8. 20
Sliding Window: Window Closes Transmission of a single byte (with SeqNo = 6) and acknowledgement is received (AckNo = 5, Win=4): 1 2 3 4 5 6 7 8 9 10 11 Transmit Byte 6 1 2 3 4 5 6 7 8 9 10 11 AckNo = 5, Win = 4 is received 1 2 3 4 5 6 7 8 9 10 11 Sliding Window: Window Opens Acknowledgement is received that enlarges the window to the right (AckNo = 5, Win=6): 1 2 3 4 5 6 7 8 9 10 11 AckNo = 5, Win = 6 is received 1 2 3 4 5 6 7 8 9 10 11 A receiver opens a window when TCP buffer empties (meaning that data is delivered to the application). 21
Sliding Window: Window Shrinks Acknowledgement is received that reduces the window from the right (AckNo = 5, Win=3): 1 2 3 4 5 6 7 8 9 10 11 AckNo = 5, Win = 3 is received 1 2 3 4 5 6 7 8 9 10 11 Shrinking a window should not be used Window Management in TCP The receiver is returning two parameters to the sender AckNo window size (win) 32 bits 16 bits The interpretation is: I am ready to receive new data with SeqNo= AckNo, AckNo+1,., AckNo+Win+1 Receiver can acknowledge data without opening the window Receiver can change the window size without acknowledging data 22
Sliding Window: Example Sender sends 2K of data 2K SeqNo=0 Receiver Buffer 0 4K Sender sends 2K of data Sender blocked AckNo=2048 Win=2048 2K SeqNo=2048 AckNo=4096 Win=0 AckNo=4096 Win=1024 2K 4K 3K TCP Congestion Control 23
TCP Congestion Control TCP has a mechanism for congestion control. The mechanism is implemented at the sender The window size at the sender is set as follows: Send Window = MIN (flow control window, congestion window) Send Window = MIN (flow control window, congestion window) where flow control window is advertised by the receiver congestion window is adjusted based on feedback from the network TCP Congestion Control The sender has two additional parameters: Congestion Window (cwnd) Initial value is 1 MSS (=maximum segment size) counted as bytes Slow-start threshhold Value (ssthresh) Initial value is the advertised window size) Congestion control works in two modes: slow start (cwnd < ssthresh) congestion avoidance (cwnd >= ssthresh) 24
Slow Start Initial value: cwnd = MSS bytes (=1 segment) Each time an ACK is received, the congestion window is increased by MSS bytes. cwnd = cwnd + MSS bytes If an ACK acknowledges two segments, cwnd is still increased by only MSS bytes (= 1 segment). Even if ACK acknowledges a segment that is smaller than MSS bytes long, cwnd is increased by MSS bytes. Does Slow Start increment slowly? Not really. In fact, the increase of cwnd can be exponential Slow Start Example The congestion window size grows very rapidly For every ACK, we increase cwnd by 1 irrespective of the number of segments ACK ed TCP slows down the increase of cwnd when cwnd > ssthresh cwnd = 1xMSS cwnd = 2xMSS cwnd = 4xMSS cwnd = 7xMSS segment 1 ACK for segment 1 segment 2 segment 3 ACK for segments 2 ACK for segments 3 segment 4 segment 5 segment 6 ACK for segments 4 ACK for segments 5 ACK for segments 6 25
Congestion Avoidance Congestion avoidance phase is started if cwnd has reached the slow-start threshold value If cwnd >= ssthresh then each time an ACK is received, increment cwnd as follows: cwnd = cwnd + MSS * MSS / cwnd + segsize / 8 So cwnd is increased by one segment (=MSS bytes) only if all segments have been acknowledged. Slow Start / Congestion Avoidance Here we give a more accurate version than in our earlier discussion of Slow Start: If cwnd <= ssthresh then Each time an Ack is received: cwnd = cwnd + MSS else /* cwnd > ssthresh */ Each time an Ack is received : cwnd = cwnd + MSS * MSS / cwnd + segsize / 8 endif 26
Example of Slow Start/Congestion Avoidance Assume that ssthresh = 8 cwnd = 1 cwnd = 2 cwnd = 4 Cwnd (in segments) 14 12 10 8 6 4 2 0 t=0 ssthresh t=2 t=4 Roundtrip times t=6 cwnd = 8 cwnd = 9 cwnd = 10 Responses to Congestion Most often, a packet loss in a network is due to an overflow at a congested router (rather than due to a transmission error) So, TCP assumes there is congestion if it detects a packet loss A TCP sender can detect lost packets via: Timeout of a retransmission timer Receipt of a duplicate ACK When TCP assumes that a packet loss is caused by congestion and reduces the size of the sending window 27
TCP Tahoe Congestion is assumed if sender has timeout or receipt of duplicate ACK Each time when congestion occurs, cwnd is reset to one: cwnd = 1 ssthresh is set to half the current size of the congestion window: ssthressh = cwnd / 2 and slow-start is entered Slow Start / Congestion Avoidance A typical plot of cwnd for a TCP connection (MSS = 1500 bytes) with TCP Tahoe: 28
TCP Error Control Background on Error Control TCP Error Control Background: ARQ Error Control Two types of errors: Lost packets Damaged packets Most Error Control techniques are based on: 1. Error Detection Scheme (Parity checks, CRC). 2. Retransmission Scheme. Error control schemes that involve error detection and retransmission of lost or corrupted packets are referred to as Automatic Repeat Request (ARQ) error control. 29
Background: ARQ Error Control All retransmission schemes use all or a subset of the following procedures: Positive acknowledgments (ACK) Negative acknowledgment (NACK) All retransmission schemes (using ACK, NACK or both) rely on the use of timers The most common ARQ retransmission schemes are: Stop-and-Wait ARQ Go-Back-N ARQ Selective Repeat ARQ Background: ARQ Error Control The most common ARQ retransmission schemes: Stop-and-Wait ARQ Go-Back-N ARQ Selective Repeat ARQ The protocol for sending ACKs in all ARQ protocols are based on the sliding window flow control scheme 30
Background: Stop-and-Wait ARQ Stop-and-Wait ARQ is an addition to the Stop-and-Wait flow control protocol: Packets have 1-bit sequence numbers (SN = 0 or 1) Receiver sends an ACK (1-SN) if packet SN is correctly received Sender waits for an ACK (1-SN) before transmitting the next packet with sequence number 1-SN If sender does not receive anything before a timeout value expires, it retransmits packet SN Background: Stop-and-Wait ARQ Lost Packet A Timeout Packet 1 Packet 0 Packet 1 Packet 1 ACK 0 ACK 1 ACK 0 B 31
Background: Go-Back-N ARQ Operations: A station may send multiple packets as allowed by the window size Receiver sends a NAK i if packet i is in error. After that, the receiver discards all incoming packets until the packet in error was correctly retransmitted If sender receives a NAK i it will retransmit packet i and all packets i+1, i+2,... which have been sent, but not been acknowledged Example of Go-Back-N ARQ Frames waiting for ACK/NAK 1 2 3 2 3 4 2 3 4 A A A 3 2 ACK2 B Frames received frame 1 is correct, send ACK 2 4 3 NAK2 B frame 2 is in error, send NAK2 4 3 retransmit frame 2,3,4 2 B 1 1 1 In Go-back-N, if frames are correctly delivered, they are delivered in the correct sequence Therefore, the receiver does not need to keep track of `holes in the sequence of delivered frames 32
Background: Go-Back-N ARQ Lost Packet A Packets 4,5,6 are retransmitted Packet 6 Packet 5 Packet 4 Packet 6 Packet 5 Packet 4 Packet 3 Packet 2 Packet 1 Packet 0 ACK 3 NAK 4 ACK 6 B Packets 5 and 6 are discarded Background: Selective-Repeat ARQ Similar to Go-Back-N ARQ. However, the sender only retransmits packets for which a NAK is received Advantage over Go-Back-N: Fewer Retransmissions. Disadvantages: More complexity at sender and receiver Each packet must be acknowledged individually (no cumulative acknowledgements) Receiver may receive packets out of sequence 33
Example of Selective-Repeat ARQ Frames waiting for ACK/NAK 1 2 3 2 3 4 2 3 4 5 A A A 3 2 ACK2 Frames received 1 B frame 1 is correct, send ACK 2 4 3 NAK2 B frame 2 is in error, send NAK2 5 2 retransmit frame 2 B 1 1 3 4 Receiver must keep track of `holes in the sequence of delivered frames Sender must maintain one timer per outstanding packet Background: Selective-Repeat ARQ Lost Packet A only Packet 4 is retransmitted Packet 3 Packet 2 Packet 1 Packet 0 ACK 1 ACK 2 ACK 3 Packet 4 ACK 4 Packet 0 Packet 7 Packet 4 Packet 6 Packet 5 ACK 6, NAK 4 ACK 7 ACK 5 ACK 0 ACK 1 B Packets 5 and 6 are buffered 34
Error Control in TCP TCP implements a variation of the Go-back-N retransmission scheme TCP maintains a Retransmission Timer for each connection: The timer is started during a transmission. A timeout causes a retransmission TCP couples error control and congestion control (I.e., it assumes that errors are caused by congestion) TCP allows accelerated retransmissions (Fast Retransmit) TCP Retransmission Timer Retransmission Timer: The setting of the retransmission timer is crucial for efficiency Timeout value too small results in unnecessary retransmissions Timeout value too large long waiting time before a retransmission can be issued A problem is that the delays in the network are not fixed Therefore, the retransmission timers must be adaptive 35
Round-Trip Time Measurements The retransmission mechanism of TCP is adaptive The retransmission timers are set based on round-trip time (RTT) measurements that TCP performs Segment 1 The RTT is based on time difference between segment transmission and ACK But: TCP does not ACK each segment Each connection has only one timer RTT #1 RTT #2 RTT #3 ACK for Segment 4 ACK for Segment 5 ACK for Segment 1 Segment 2 Segment 3 ACK for Segment 2 + 3 Segment 5 Segment 4 Round-Trip Time Measurements Retransmission timer is set to a Retransmission Timeout (RTO) value. RTO is calculated based on the RTT measurements. The RTT measurements are smoothed by the following estimators srtt and rttvar: srtt n+1 = α RTT + (1- α ) srtt n rttvar n+1 = β ( RTT - srtt n+1 ) + (1- β ) rttvar n RTO n+1 = srtt n+1 + 4 rttvar n+1 The gains are set to α =1/4 and β =1/8 srtt 0 = 0 sec, rttvar 0 = 3 sec, Also: RTO 1 = srtt 1 + 2 rttvar 1 36
Karn s Algorithm If an ACK for a retransmitted segment is received, the sender cannot tell if the ACK belongs to the original or the retransmission. RTT? RTT? Timeout! segment retransmission of segment ACK Karn s Algorithm: Don t update srtt on any segments that have been retransmitted. Each time when TCP retransmits, it sets: RTO n+1 = max ( 2 RTO n, 64) (exponential backoff) Measuring TCP Retransmission Timers ftp session from ellington to satchmo ellington.cs.virginia.edu satchmo.cs.virginia.edu Transfer file from ellington to satchmo Unplug Ethernet cable in the middle of file transfer 37
Interpreting the Measurements The interval between retransmission attempts in seconds is: 1.03, 3, 6, 12, 24, 48, 64, 64, 64, 64, 64, 64, 64. Time between retransmissions is doubled each time (Exponential Backoff Algorithm) Timer is not increased beyond 64 seconds TCP gives up after 13th attempt and 9 minutes. Seconds 600 500 400 300 200 100 0 0 2 4 6 8 10 Transmission Attempts 12 RTO Calculation: Example At t 1 : RTO = srtt + 2 rttvar = 6 sec At t 2 : RTO= 2 * (srtt + 4rttvar) = 24 sec (exponential backoff) At t 4 : RTO is not updated (Due to Karn s algorithm) SYN SYN SYN + ACK ACK Segment 1 ACK for Segment 1 Segment 2 Segment 3 ACK for Segment 2 ACK for Segment 3 Segment 4. Segment 5. Segment 6. ACK for Segment 4 Timeout! RTT #1 RTT #2 RTT #3 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 38