EE 122: Transport Protocols Kevin Lai October 16, 2002
Motivation IP provides a weak, but efficient service model (best-effort) - packets can be delayed, dropped, reordered, duplicated - packets have limited size (why?) IP packets are addressed to a host - how to decide which application gets which packets? How should hosts send into the network? - every sends as fast as they can drop many packets, network is under-utilized (congestion collapse) laik@cs.berkeley.edu 2
Transport Protocol Provides more than the underlying network protocol - more reliability, in order delivery, at most once delivery - supports messages of arbitrary length - provide a way to decide which packets go to which applications (multiplexing/demultiplexing) - govern how hosts should send data to prevent congestion collapse (congestion control and avoidance) TCP/UDP IP Transport Layer Networking Layer Link Layer Physical Layer laik@cs.berkeley.edu 3
UDP User Datagram Protocol minimalistic transport protocol same best-effort service model as IP messages can be larger than one packet, but still limited (64KB) - uses fragmentation provides multiplexing/demultiplexing to IP does not provide congestion control advantage over TCP: does not increase end-toend delay over IP application example: video/audio streaming laik@cs.berkeley.edu 4
TCP Transmission Control Protocol reliable, in-order, and at most once delivery messages can be of arbitrary length provides multiplexing/demultiplexing to IP provides congestion control and avoidance increases end-to-end delay over IP e.g., file transfer, chat laik@cs.berkeley.edu 5
Headers IP IP header used for IP routing, fragmentation, error detection UDP header used for multiplexing/demultiplexing, error detection TCP header used for multiplexing/demultiplexing, flow and congestion control TCP/UDP TCP/UDP data data data Sender Application TCP UDP IP Receiver Application TCP UDP IP IP TCP/UDP TCP/UDP data data data laik@cs.berkeley.edu 6
IP Header 0 4 8 16 19 31 Version HLen TOS Length Identification Flags Fragment offset TTL Protocol Header checksum Source address Destination address Options (variable) Comments Payload - HLen header length only in 32-bit words (5 <= HLen <= 15) - TOS (Type of Service): Differentiated Service (6 bits) Explicit Congestion Notification (ECN) (2 bits) - Length the length of the entire datagram/segment; header + data - Flags: Don t Fragment (DF) and More Fragments (MF) - Protocol: identifies the transport protocol - Header checksum - uses 1 s complement 20 bytes laik@cs.berkeley.edu 7
Fragmentation What happens if router has to forward an IP packet that is larger than allowed by a data link layer? Break the IP packet into smaller IP packets and provide a way to reassemble - set more fragments bit in all fragments but last - set the fragment offset of fragment to be offset (in 8- byte offsets) from beginning of original packet - set the packet len to be length of this fragment laik@cs.berkeley.edu 8
Fragmentation Issues Sending host had better be changing the IP ID Loose one fragment, loose them all Reassembly is complex - requires per packet state Only reassemble at destination Fragmentation can be avoided using Path Maximum Transmission Unit Discovery (PMTU) - most TCP implementations use PMTU laik@cs.berkeley.edu 9
UDP Header 0 16 31 Source port Destination port UDP length Payload (variable) UDP checksum Source and destination ports use port address space UDP length is UDP packet length (including UDP header and payload, but not IP header) Optional UDP checksum is over UDP packet - why have UDP checksum in addition to IP checksum? - why not have just the UDP checksum? - why is the UDP checksum optional? laik@cs.berkeley.edu 10
Port Addressing Need to decide which application gets which packets Solution: map each socket to a port Client must know server s port separate 16-bit port address space for UDP and TCP - (src IP, src port, dst IP, dst port) uniquely identifies TCP connection Well known ports(0-1023): everyone agrees which services run on these ports - e.g., ssh:22, http:80 - on UNIX, must be root to gain access to these ports (why?) ephemeral ports(most 1024-65535): given to clients - e.g. chatclient gets one of these laik@cs.berkeley.edu 11
TCP Header 0 4 10 16 31 Source port Destination port HdrLen Checksum Sequence number Acknowledgement Flags Options (variable) Payload (variable) Advertised window Urgent pointer Sequence number, acknowledgement, and advertised window used by sliding-window based flow control Flags: - SYN, FIN establishing/terminating a TCP connection - ACK set when Acknowledgement field is valid - URG urgent data; Urgent Pointer says where non-urgent data starts - PUSH don t wait to fill segment - RESET abort connection laik@cs.berkeley.edu 12
TCP Challenges how to provide reliable, in-order, and at most once delivery? (sliding window) need to synchronize sender and receiver (connection establishment) - e.g., exchange initial sequence numbers prevent sender from sending too fast for receiver (flow control) estimate RTT for flow control and timeouts how to initially decide on sending rate (slow start) estimate how much bandwidth is available in network (congestion avoidance) slow down sending rate when we were sending too fast (congestion control) laik@cs.berkeley.edu 13
Connection Establishment: How it works Three-way handshake - Goal: agree on a set of parameters: the start sequence number for each side - Starting sequence numbers are random. Active Open Client (initiator) connect() SYN, SeqNum = x SYN and ACK, SeqNum = y and Ack = x + 1 Server listen() accept() Passive Open ACK, Ack = y + 1 allocate buffer space laik@cs.berkeley.edu 14
Three-way Handshake: Rationale Three-way handshare adds 1 RTT delay Why not just start sending data immediately? - congestion control network could be congested SYN = 40 bytes, Data < 1500 bytes packets which are dropped at a link waste the bandwidth of all previous links smaller packets waste less bandwidth SYN acts as cheap probe of network conditions laik@cs.berkeley.edu 15
More Rationale - protection from denial of service (1) attacker could use one host to fake many SRC IP address (spoofing) and send many SYNs to server server must devote resources (e.g., buffer space) for open connections server would run out of resources and become very slow or crash 3-way handshake requires client to reply before server allocates significant resources - protection from denial of service (2) client and server begin connection using wellknown sequence number instead of random one attacker guesses sequence number, inserts bogus packets into stream laik@cs.berkeley.edu 16
Even More Rationale - protection from delayed packets client connects to server twice in succession using the same port a packet from the first connection is delayed and arrives during the second connection if sequence numbers are close, old packet could be accepted laik@cs.berkeley.edu 17
Flow control: Window Size and Throughput Sliding-window based flow control: - Higher window higher throughput Throughput = wnd/rtt Remember: window size control throughput How to determine effective window size? How to detect packet loss? RTT (Round Trip Time) wnd = 3 segment 1 segment 2 segment 3 ACK 1 ACK 2 ACK 3 segment 4 segment 5 segment 6 1/18/2000 18
Effective Window Size Receiver window (MaxRcvBuf maximum buffer size at receiver) AdvertisedWindow = MaxRcvBuffer (LastByteRcvd LastByteRead) Sender window (MaxSendBuf maximum buffer size at sender) EffectiveWindow = AdvertisedWindow (LastByteSent LastByteAcked) MaxSendBuffer >= LastByteWritten - LastByteAcked Sending Application MaxSendBuffer LastByteWritten Receiving Application MaxRcvBuffer LastByteRead LastByteAcked LastByteSent NextByteExpected LastByteRcvd sequence number increases sequence number increases laik@cs.berkeley.edu 19
Advertised Window = 0 Sender cannot send any data receiver will not send acks receiver cannot notify sender that advertised window has grown Solution: TCP Persist Timer - when sender gets advertised window == 0, it sets timer - if sender receives advertised window > 0, cancels timer - when timer expires, sender sends 1 byte payload to receiver receiver must accept data 1 byte past window - receiver sends ack for byte before 1 byte - sender gets new advertised window laik@cs.berkeley.edu 20
Silly Window Syndrome (SWS) app: send 1 app: send w+1 advwin=w advwin=w advwin=w size=1 size=w-1 size=1 advwin = w app: read 1 app: read w-1 Maximum Segment Size (MSS) = w App sends of small segments and/or receiver advertises small window - causes small packets to be sent in network - small packets have high header overhead laik@cs.berkeley.edu 21
SWS Solution Sender only sends if - no unacknowledged data, (Nagle s algorithm) or - full packet to send Receiver only sends new advertised window if - newadvwin oldadvwin > min(mss, 0.5*maxRcvBuf) laik@cs.berkeley.edu 22
Set timeout If haven t received ack by timeout, retransmit packet after last acked packet How to set timeout? - Too long: connection has low throughput - Too short: retransmit packet that was just delayed packet was probably delayed because of congestion sending another packet too soon just makes congestion worse Solution: make timeout proportional to RTT laik@cs.berkeley.edu 23
RTT Estimation Use exponential averaging: SampleRTT= AckRcvdTime SendSegmentTime EstimatedRTT = α EstimatedRTT + (1 α) SampleRTT TimeOut = 2 EstimatedRTT 0 < α 1 EstimatedRTT SampleRTT laik@cs.berkeley.edu 24 Time
Problem How to differentiate between the real ACK, and ACK of the retransmitted packet Sender Receiver Sender Receiver ACK ACK Original Transmission Original Transmission Retransmission Retransmission SampleRTT SampleRTT laik@cs.berkeley.edu 25
Karn/Partridge Algorithm Measure SampleRTT only for original transmissions Exponential backoff for each retransmission, double EstimatedRTT laik@cs.berkeley.edu 26
Jacobson/Karels Algorithm Problem: exponential average is not enough - one solution: use standard deviation (requires expensive square root computation) - use mean deviation instead Difference = SampleRTT EstimatedRTT EstimatedRTT= EstimatedRTT+ δ Difference Deviation = Deviation + δ ( Difference Deviation) TimeOut = µ EstimatedRTT + φ Deviation 0 < δ 1 µ = 1 φ = 4 laik@cs.berkeley.edu 27
Summary IP - routing, fragmentation UDP - Multiplexing/demultiplexing using ports - error detection TCP - reliable, in order, at most once delivery - Connection establishment three way handshake - RTT exponential averaging and variance - Flow control based on sliding window protocol - Congestion control next lecture laik@cs.berkeley.edu 28