Networked Systems and Services, Fall 2018 Chapter 3 Jussi Kangasharju Markku Kojo Lea Kutvonen
4. Transport Layer Reliability with TCP Transmission Control Protocol (TCP) RFC 793 + more than hundred other RFCs TCP Loss Recovery mechanisms (not exhaustive): Timer (RTO) Recovery & TCP Reno [RFC 5681] TCP NewReno [RFC 3782, RFC 6582] Limited Transmit [RFC 3042] TCP SACK-based Loss Recovery [RFC 2018, RFC 6675] 2
Remember the Protocol Stack? End-to-end Argument? User Application Transport Network What is the right place to implement reliability? Transport is lowest level end-to-end protocol(in theory) Network User Application Transport Network Link Link Link Physical Physical Physical
Transport Layer Application Presentation Session Transport Network Data Link Physical Function: Demultiplexing (port numbers) Optional functions (TCP provides): Creating long lived connections Reliable, in-order packet delivery Error detection Flow and congestion control Key challenges: Efficient data delivery in the presence of losses Detecting and responding to congestion Balancing fairness against high utilization 4
TCP Transport Service to Applications Connections, 3 Phases of communication Connection establishment, data transfer, connection termination Bidirectional byte stream TCP does not provide messages ==> Applications have to take care of message boundaries! (cf., Unix pipes) Reliable transport No data loss or corruption, data delivered in order, no duplication of data Flow control Congestion control (with approximate fairness) 5
TCP Segment Format 0 4 10 16 24 31 Source port Destination port TCP header length Reserved Checksum Checksum Sequence number Acknowledgement number U A PRSF RCS S Y I G K H T N N Window Urgent pointer Options (0 or more 32 bit words) (padding) Payload (optional) 6
TCP Options Options field for optional features Option space limited - TCP header length field (= 4 bits) indicates the length of the header in 32-bit words => Header max length 15*4 bytes = 60 bytes - 20 bytes for the fixed header => max. 40 bytes for options Option type Option length Option value 1 byte 1 byte length - 2 bytes 7
Connection Setup Why do we need connection setup? To establish state on both hosts Most important state: sequence numbers - Count the number of bytes that have been sent&received - Initial value chosen at random - Why? Client Server isn = initial sequence number 8
Data Transfer & Sequence Number Space TCP uses a byte stream abstraction Each byte in each stream is (implicitly) numbered 32-bit value, wraps around Byte stream broken down into segments (packets) Size limited by the Maximum Segment Size (MSS) MSS set to limit IP fragmentation MSS is based on the local link MTU size and the Receiver MSS negotiated during the connection setup (MSS Option) OR using Path MTU Discovery Each segment has a sequence number 13450 14950 16050 17550 Segment 8 Segment 9 Segment 10 9
TCP Connection Termination Client Server Close Close Timed wait 2*MSL Closed Time Time Closed 10
How are TCP Acks Generated? Acknowledgement number indicates the sequence number that receiver expects to receive next Highest sequence number received in order + 1 Acknowledgements are cumulative ACK number k implies ACK of all sequences numbers < k Delayed ACKs Receiver does not need to acknowledge each segment separately At least every second (full-sized) segment should be acknowledged Sending ACK is delayed at most 500 msecs, if the next segment in order has not arrived - Many implementations use delayed ACK timer of 200 msecs Out-of-order segments are acknowledged immediately! Send ACK for the highest sequence number received in order 11
TCP Congestion Control TCP congestion control is one of the most important functions to ensure stable operation of the Internet When routers become congested, they have to drop packets Congestion control is intertwined with the loss recovery and thereby with TCP performance Congestion window,cwnd, controls how much unacknowledged data can be in flight in the network (FlightSize) Largest allowed sequence number when sending = highest acknowledged sequence number + cwnd In the following, Congestion window and other congestion control details are present, but we focus on loss recovery In the Internet Protocol course we focus on congestion control 12
Slow Start Slow Start is used, when the network state is considered unknown At the beginning of the TCP connection (Initial Slow Start) After retransmission timer expires (in RTO Recovery) When there has not been anything to send for a while (Restart After Idle) Basic idea Increasecwnd until segment loss is detected = network becomes congested ORcwnd reaches Slow Start Threshold (ssthresh) cwnd is increased at most by one MSS per arriving new acknowledgement (an ACK that acknowledges new data) -cwnd gets roughly doubled per each Round-Trip Time (RTT) * Exponential growth as function of RTT 13
Slow Start cwnd starts from 1 MSS for RTO loss recovery Cwnd can be larger for Initial Slow Start and Restart After Idle Sender cwnd = 1 MSS 1. RTT Receiver Data segment cwnd = 2 MSS 2. RTT ACK cwnd = cwnd = 3 MSS 4 MSS 3. RTT Slow Start also effectively diverts RTO Recovery away from go-back-n 14
Retransmission Timeout (RTO) Recovery Retransmission timer is set when data segment is sent When retransmission timer expires, it starts the RTO loss recovery with congestion control actions: cwnd = 1 MSS; ssthresh = max (FlightSize/2, 2*SMSS) (*) Retransmit first unacknowledged segment Continue (re)transmission in Slow Start until cwnd > ssthresh after which enter Congestion Avoidance Each new ACK indicates next sequence number (segment) to retransmit When there are no more segments to retransmit, continue by transmitting new data (*) FlightSize: the amount of unacknowledged data a TCP sender has in flight ssthresh (Slow-Start Threshold): is used to indicate previously observed safe sending rate 15
RTO Recovery For simplicity: MSS = 1 B (byte) Sender Receiver Assume TCP cwnd = 4 MSS Ack =2 RTO Ack =2 Ack=2 Timer expires, enter RTO loss recovery ssthresh= 2; cwnd = 1 cwnd = 2... Time... Time 16
Retransmission timer [RFC 6298] Retransmission timer runs for the first unacknowledged segment Important to find a proper value for the retransmission timer: Too big timer value: start of the loss recovery is delayed Too small timer value: timer expires spuriously - Results in unnecessary retransmissions, in the worst case full window of data is unnecessarily retransmitted! - Results also in unnecessary congestion response (transmission rate decreased) Initial RTO value >= 1 sec (recently changed from 3 to 1 sec) After this, a proper value is estimated (computed) dynamically from the measured Round-Trip Time (RTT) 17
Calculating RTO timer value RTT is measured continuously when ACKs arrive TCP sender calculates weighted moving average, to be used as the smoothed RTT:SRTT SRTT is updated each time an RTT sample is measured (at least once per window, i.e., once per RTT) SRTT = (1-α)*SRTT + α*rttsample where α = 1/8 = 0.125 Calculates also RTT variation,rttvar RTTvar =(1-β)*RTTvar + β* RTTsample-SRTT where β = 1/4 = 0.25 Timer value: RTO = SRTT + 4*RTTvar 18
RTT Sample Ambiguity RTO Sample? RTO Sample? What is RTT of a retransmitted segment? 19
Accurate measurement of RTO Solution to acknowledgement ambiguity: Karn s algorithm Don t update the RTT estimate for retransmitted segments How often RTT samples are measured? Some implementations take only one sample per window (= one per RTT) - RFC 6298 requires at least once per RTT Many newer implementations, e.g., Linux, measures RTT for each valid segment In practice only one retransmission timer is running at a time For the first unacknowledged segment Timer is restarted each time when an ACK that acknowledges new data arrives ==> effective timer value = RTO + 1 RTT 20
Fast Retransmit Duplicate ACK (dupack) When an out-of-order data segment arrives at a TCP receiver, the TCP receiver acknowledges immediately with a pure ACK the highest sequence number received in order (i.e., the same acknowledgement number as in the ACK that acknowledged the last segment received in order) Receiving dupacks indicates that Segments are leaving the network A segment has been received out-of-order and what is the expected sequence number After receiving 3 consecutive dupacks, TCP sender Fast Retransmits the first unacknowledged segment [ Sets also cwnd & ssthresh (see steps 1&2 on slide 23) ] After the Fast Retransmit, the sender continues in Fast Recovery 21
New Reno Fast Recovery [RFC 6582, (old: RFC 3782 * )] Fast Recovery allows transmission of new data during loss recovery NewReno Fast Recovery is able to recover one lost segment per RTT * See RFC 3782 for possibly easier to understand description 22
Fast Recovery (NewReno) recover variable is used to determine when recovery is over (and to avoid multiple false Fast Retransmits) Fast Retransmit & Fast Recovery (NewReno) triggered by 3 rd DupACK: 1. Set recover = highest sequence number transmitted so far" [ Setssthresh = max (FlightSize / 2, 2*MSS) ] 2. Retransmit the first unacknowledged segment and [ set cwnd = ssthresh + 3*MSS ] 3. For each additional duplicate ACK received while in Fast Recovery increment cwnd by one MSS to potentially allow transmitting new segment 4. Transmit a new segment, if allowed by the new value of cwnd (andrwnd) 5. When an ACK arrives that acknowledges new data, a) If this ACK acknowledges all of the data up to and including "recover", then recovery is completed; b) Otherwise, acknowledgement is a Partial ACK and recovery should continue (see next slide) Step 3 allows transmitting new data also during loss recovery 23
NewReno/Step 5 b): Partial ACK On each Partial ACK Retransmit first unacknowledged segment [ Deflatecwnd by the amount of new data acknowledged by the Partial ACK. If the partial ACK acknowledges at least one MSS of new data, then add back MSS bytes tocwnd ] Transmit a new segment, if allowed by the new value of cwnd Continue Fast Recovery - Repeat steps 3&4 on arrival of dupack - Repeat step 5 on arrival of an ACK that acknowledges new data 24
Fast Retransmit&Fast Recovery (NewReno) Sender Receiver For simplicity: MSS = 1B cwnd = 6 ssthresh=3; cwnd = 3+3 = 6; Recover =7 cwnd = 6-2+1 = 5 cwnd = 5-2+1 = 4 cwnd = 4+1 = 5 Recovery Done cwnd = 3 Time... Time 25
Limited Transmit [RFC 3042] Problem: If cwnd is small (cwnd < 4 ), OR several segments are dropped in a single window ==> It is possible that TCP sender cannot receive three dupacks ==> TCP sender has to wait for retransmission timeout and recover using Slow Start (with drasticcwnd reduction) This delays the start of a recovery and is inefficient Solution: Limited Transmit Transmit a new data segment on each of the first two dupacks Transmitting new data segments can be allowed as a dupack indicates that a segment has left the network New data segments trigger more dupacks 26
Limited Transmit For simplicity: MSS = 1B Sender Receiver cwnd = 3 Ack =2 Ack =3 No need to wait for RTO, as three dupacks arrive RTO Fast Retransmit...... Time Time 27
TCP Selective Acknowledgements [RFC 2018, RFC 6675] Duplicate ACKs indicate only one missing segment (next expected) Similarly each cumulative ACK during recovery (i.e., NewReno partial ack) indicates only one missing segment (next expected) ==> NewReno Fast Recovery can recover only one segment per RTT Also, in RTO recovery several segments are often unnecessarily retransmitted Selective Acknowledgement (SACK) option allows identifying several missing segments with a single dupack 28
SACK option (RFC 2018) TCP SACK-permitted option type =4 length=2 1 byte 1 byte Used in connection establishment (with SYN segments) to negotiate the use of SACK option TCP SACK option Carries information about sequence number ranges (SACK blocks) that have arrived successfully, but outof-order, at the receiver (stored in the receive buffer) 29
TCP SACK option type =5 length=n Beginning of the 1 st block (seq.no) End of the 1st block (seq.no+1) Beginning of the 2 nd block (seq.no) End of the 2 nd block (seq.no+1) Beginning of the 3 rd block (seq.no) End of the 3 rd block (seq.no+1) One TCP segment may carry max 4 SACK blocks, as max 40 bytes have been reserved for TCP options (use of other TCP options reduces this). 30
Sending SACK option Always, when acknowledging an out-of-order segment (i.e., always, when acknowledging other than the highest sequence number that has arrived) SACK option includes as many latest sequence number ranges as possible Each arrived segment (block) becomes reported several times (i.e., repeated with the later ACKs) First block in the SACK option includes the segment that triggered the acknowledgement SACK information is only informative for the TCP sender TCP sender must not remove a segment from its send buffer until a cumulative ACK acknowledging it arrives 31
SACK-based Recovery [RFC 6675] With help of the SACK option a TCP sender may recover more than one lost segment within one RTT (cf. NewReno) TCP sender maintains scoreboard data structure with the retransmission queue (updated on arrival of an ACK and after transmitting a segment) SACKed: information whether a SACK block corresponding to the segment has been received HighACK: sequence number of the highest byte of data that has been cumulatively ACKed HighRxt: highest sequence number that has been retransmitted during the current loss recovery phase HighData: highest sequence number transmitted pipe: an estimate of the number of bytes (segments) outstanding in the network cwnd limits transmission of segments during loss recovery; Ifcwnd pipe >= 1 SMSS, sender can transmit segments If there are segments that are considered lost, retransmit as many lost segments as cwnd allows - a segment is considered lost, if at least 3 discontinuous SACKed sequences have arrived above the segment or more than 2 * SMSS bytes with sequence numbers above the segment have been SACKed If there are not enough lost segments to transmit, transmit as many new data segments ascwnd allows If no lost nor new segments to transmit, follow the rules in Steps (3) & (4) of NextSeg() in RFC 6675 to retransmit one data segment not considered lost 32
SACK Fast Retransmit (RFC 6675) If at least 3 segments above HighAck+1 has been SACKed (*): 1. SetRecoveryPoint = HighData; 2. [ Set ssthresh = cwnd = FlightSize / 2 ] 3. Retransmit the first unacknowledged segment and set HighRxt = highest sequence number in the retransmitted segment 4. Recalculate a new value for pipe: Includes all data (segments) that have been sent but not ACKed (either cumulatively or SACKed), but not segments that are considered lost ( = at least 3 later segments after the segment have reached the receiver and have been SACKed) Includes all retransmitted data (segments) (HighACK < seqno <=HighRxt) 5. Ifcwnd pipe >= 1 SMSS, sender can transmit segments In the first place retransmit lost segments then transmit new data As many as allowed by cwnd If no lost segments nor new data, send one segment as per Steps (3) & (4) of NextSeg() After transmitting, updatehighrxt,highdata andpipe (*) On each ACK with SACKed data, use Limited Transmit to send at most one SMSS of new data ( ifcwnd pipe >= 1 SMSS ) 33
SACK Fast Recovery (cont d) On each arriving ACK: A. If cumulative ACK number >RecoveryPoint Recovery completed, exit FastRecovery B. If cumulative ACK number <=RecoveryPoint Update scoreboard with SACK information Update pipe (like in step 4 above) C. Ifcwnd pipe >= 1 SMSS, sender can transmit segments In the first place retransmit lost segments then transmit new data As many as allowed by cwnd If no lost segments nor new data, send one segment as per Steps (3) & (4) of NextSeg() After transmitting, updatehighrxt,highdata andpipe 34
Fast Retransmit & Fast Recovery (SACK) Sender Receiver For simplicity: MSS = 1B ack=2 ack=2; SACK 3 cwnd = 6 ack=2; SACK 3, 5 ack=2; SACK 3, 5, 7 RecoveryPoint=7; ssthresh=3 cwnd = 3; pipe =2 pipe =3 ack=4; SACK 5, 7 cwnd = 3; pipe = 2 pipe = 3 ack=4; SACK 5, 7-8 Recovery Done pipe = 1 pipe = 2 pipe = 3 pipe = 2 pipe = 3 pipe = 1 pipe = 2 pipe = 3 pipe = 2 Time ack=6, SACK 7-8 ack=6, SACK 7-9 ack=6, SACK 7-10 ack=11 ack=12 Time 35