Networked Systems and Services, Fall 2017 Reliability with TCP Jussi Kangasharju Markku Kojo Lea Kutvonen
4. Transmission Control Protocol (TCP) RFC 793 + more than hundred other RFCs TCP Loss Recovery mechanisms (not exhaustive): Timer (RTO) Recovery & TCP Reno [RFC 5681] TCP NewReno [RFC 3782, RFC 6582] Limited Transmit [RFC 3042] TCP SACK-based Loss Recovery [ based Loss Recovery [RFC 2018, RFC 6675] 2
Remember the Protocol Stack? End-to-end Argument? User Application Transport Network What is the right place to implement reliability? Transport is lowest level end-to-end protocol(in theory) Network User Application Transport Network Link Link Link Physical Physical Physical
Transport Layer Application Presentation Session Transport Network Data Link Physical Function: Demultiplexing (port numbers) Optional functions (TCP provides): Creating long lived connections Reliable, in-order packet delivery Error detection Flow and congestion control Key challenges: Efficient data delivery in the presence of losses Detecting and responding to congestion Balancing fairness against high utilization 4
TCP Transport Service to Applications Connections, 3 Phases of communication Connection establishment, data transfer, connection termination Bidirectional byte stream TCP does not provide messages ==> Applications have to take care of message boundaries! (cf., Unix pipes) Reliable transport No data loss or corruption, data delivered in order, no duplication of data Flow control Congestion control (with approximate fairness) 5
TCP Segment Format 0 4 10 16 24 31 Source port Destination port TCP header length Reserved Checksum Checksum Sequence number Acknowledgement number U A PRSF RCS S Y I G K H T N N Window Urgent pointer Options (0 or more 32 bit words) (padding) Payload (optional) 6
TCP Options Options field for optional features Option space limited - TCP header length field (= 4 bits) indicates the length of the header in 32-bit words => Header max length 15*4 bytes = 60 bytes - 20 bytes for the fixed header => max. 40 bytes for options Option type Option length Option value 1 byte 1 byte length - 2 bytes 7
Connection Setup Why do we need connection setup? To establish state on both hosts Most important state: sequence numbers - Count the number of bytes that have been sent&received - Initial value chosen at random - Why? Client Server isn = initial sequence number 8
Data Transfer & Sequence Number Space TCP uses a byte stream abstraction Each byte in each stream is (implicitly) numbered 32-bit value, wraps around Initial, random values selected during setup Byte stream broken down into segments (packets) Size limited by the Maximum Segment Size (MSS) MSS set to limit IP fragmentation MSS is based on the local link MTU size and the Receiver MSS negotiated during the connection setup (MSS Option) OR using Path MTU Discovery Each segment has a sequence number 13450 14950 16050 17550 Segment 8 Segment 9 Segment 10 9
TCP Connection Termination Client Server Close Close Timed wait 2*MSL Closed Time Time Closed 10
How are TCP Acks Generated? Acknowledgement number indicates the sequence number that receiver expects to receive next Highest Sequence number received in order + 1 Acknowledgements are cumulative ACK number k implies ACKs all sequences numbers < k Delayed ACKs Receiver does not need to acknowledge each segment separately At least every second (full-sized) segment should be acknowledged Sending ACK is delayed at most 500 msec, if the next segment in order has not arrived - Many implementations use delayed ACK timer value of 200 msec Out-of-order order segments are acknowledged immediately! Send ACK for the highest sequence number received in order 11
TCP Congestion Control TCP congestion control is one of the most important functions to ensure the operation of the Internet When routers become congested, they have to drop packets Congestion control is intertwined with the loss recovery and thereby with TCP performance Congestion window,cwnd, controls how much unacknowledged data can be in flight in the network (FlightSize) Largest allowed sequence number when sending = highest acknowledged sequence number + cwnd 12
TCP self-clocking (ack-clocking) clocking) TCP sender network TCP receiver Data segments 8 76 5 4 3 2 1 8 7 6 5 4 3 2 1 Next to send cwnd 2 3 4 5 ACKs first unacknowledged ACK arrives ==> a data segment has left the network, so TCP sender can send more data (the same amount that left the network) Acknowledgements control the transmission rate (ack clocking) and direct how cwnd is updated 13
Slow Start Slow Start is used, when the network state is considered unknown At the beginning of the TCP connection (Initial Slow Start) After retransmission timer expires (in RTO Recovery) When there has not been anything to send for a while (Restart After Idle) Purpose of the Slow Start Start ACK clocking Determine the available network capacity Basic idea Increasecwndcwnd until segment loss is detected = network becomes congested ORcwnd reaches Slow Start Threshold (ssthresh) cwnd is increased at most by one MSS per arriving new acknowledgement (an ACK that acknowledges new data) -cwnd gets roughly doubled per each Round-Trip Time (RTT) * exponential growth as function of RTT 14
cwnd starts from 1 MSS Slow Start Cwnd can be larger for Initial Slow Start and Restart After Idle Sender cwnd = 1 SMSS 1. RTT Receiver Data segment cwnd = 2 SMSS 2. RTT ACK cwnd = cwnd = 3 SMSS 4 SMSS 3. RTT Slow Start also effectively diverts RTO Recovery away from go-back back-n 15
Retransmission Timeout (RTO) Recovery Retransmission timer is set when data segment is sent When retransmission timer expires, it starts the RTO loss recovery with congestion control actions: cwnd = 1 MSS; ssthresh = max (FlightSize/2, 2*SMSS) (*) Retransmit first unacknowledged segment Continue (re)transmission in Slow Start until cwnd > ssthresh after which enter Congestion Avoidance Each new ACK indicates next sequence number (segment) to retransmit When there are no more segments to retransmit, continue by transmitting new data (*) FlightSize= the amount of unacknowledged data a TCP sender has in fligh ssthresh (Slow-Start Start Threshold) is used to indicate previously observed safe sending rate 16
RTO Recovery For simplicity: MSS = 1 B (byte) Sender Receiver Assume TCP cwnd = 4 MSS Ack =2 RTO Ack =2 Ack=2 Timer expires, enter RTO loss recovery! ssthresh= 2; cwnd = 1 cwnd = 2 Recovery Done... Time... Time 17
Retransmission timer [RFC 6298] Retransmission timer runs for the first unacknowledged segment Important to find a proper value for the retransmission timer: Too big timer value: start of the loss recovery is delayed - Slows down the start of loss recovery Too small timer value: timer expires spuriously - Results in unnecessary retransmissions, in the worst case full window of data is unnecessarily retransmitted! - Results also in unnecessary congestion response (transmission rate decreased) Initial RTO value >= 1 sec (recently changed from 3 to 1 sec) - After this, a proper value is estimated (computed) dynamically from the measured Round-Trip Time (RTT) 18
Calculating RTO timer value RTT is measured continuously when ACKs arrive TCP sender calculates weighted moving average, to be used as the smoothed RTT:SRTT SRTT is updated each time an RTT sample is measured (at least once per window, i.e., once per RTT) SRTT = (1- )*SRTT + *RTTsample where = 1/8 = 0.125 Calculates also RTT variation,rttvar RTTvar =(1- )* )*RTTvar + * RTTsample-SRTT where = 1/4 = 0.25 Timer value: RTO = SRTT + 4*RTTvar 19
RTT Sample Ambiguity RTO Sample RTO Sample? What is RTT of a retransmitted segment? 20
Accurate measurement of RTO Solution to acknowledgement ambiguity: Karn s algorithm Don t update the RTT estimate for retransmitted segments After RTO, use Exponential backoff to increase the timeout Keep this increased RTO value until an acknowledgement for a new data segment arrives, allowing new valid RTT sample How often RTT samples are measured? Some implementations take only one sample per window (= one per RTT) - RFC 6298 requires at least once per RTT Many newer implementations, e.g., Linux, measures RTT for each valid segment In practice only one retransmission timer is running at a time For the first unacknowledged segment Timer is restarted each time when an ACK that acknowledges new data arrives ==> effective timer value = RTO + 1 RTT 21
Fast Retransmit Duplicate ACK (dupack dupack) When an out-of-order order data segment arrives at a TCP receiver, the TCP receiver acknowledges immediately with a pure ACK the highest sequence number received in order (i.e., the same acknowledgement number as in the ACK that acknowledged the last segment received in order) Receiving dupacks indicates that Segments are leaving the network Ack clocking works A segment has been received out-of-order order and what is the expected sequence number After receiving 3 consecutive dupacks, TCP sender Fast Retransmits the first unacknowledged segment Sets cwnd & ssthresh (see steps 1&2 on next slide) After the Fast Retransmit, the sender continues in Fast Recovery 22
NewReno [RFC 6582, (old: RFC 3782 * )] Fast Recovery (Reno) first implemented 1990 Improved Solution: NewReno Allows recovering more than one lost segment in the same window of data * See RFC 3782 for possibly easier to understand description 23
Fast Recovery (NewReno) recover variable is used to determine when recovery is over (and to avoid multiple false Fast Retransmits) Fast Retransmit & Fast Recovery (NewReno) triggered by 3 rd DupACK: 1. Set recover = highest sequence number transmitted so far" Set ssthresh = max (FlightSize / 2, 2*MSS) ) and 2. Retransmit the first unacknowledged segment and set cwnd = ssthresh + 3*MSS 3. For each additional duplicate ACK received while in Fast Recovery, increment cwnd by one SMSS 4. Transmit a new segment, if allowed by the new value of cwnd (andrwnd) 5. When an ACK arrives that acknowledges new data, a) If this ACK acknowledges all of the data up to and including "recover", then recovery is completed; b) Otherwise, acknowledgement is a Partial ACK and recovery should continue (see next slide) Step 3 allows transmitting new data also during loss recovery 24
NewReno/Step 5 b): Partial ACK On each Partial ACK Retransmit first unacknowledged segment Deflatecwnd by the amount of new data acknowledged by the Partial ACK. If the partial ACK acknowledges at least one SMSS of new data, then add back SMSS bytes tocwnd Transmit a new segment, if allowed by the new value of cwnd Continue Fast Recovery - Repeat steps 3&4 on arrival of dupack - Repeat step 5 on arrival of an ACK that acknowledges new data 25
Fast Retransmit&Fast Recovery (NewReno) Sender Receiver For simplicity: MSS = 1B cwnd = 6 ssthresh=3; cwnd = 3+3 = 6; Recover =7 cwnd = 6-2+1 = 5 FlightSize = 5 cwnd = 5-2+1 = 4 FlightSize = 4 cwnd = 4+1 = 5 Recovery Done cwnd = 3 Time... Time 26
Problem: Limited Transmit [RFC 3042] If cwnd is small (cwnd < 4 ), OR several segments are dropped in a single window ==> It is possible that TCP sender cannot receive three dupacks ==> TCP sender has to wait for retransmission timeout and recover using Slow Start (with drasticcwndcwnd reduction) This delays the start of a recovery and is inefficient Solution: Limited Transmit Transmit a new data segment on each of the first two dupacks Transmitting new data segments can be allowed as a dupack indicates that a segment has left the network New data segments trigger more dupacks 27
Limited Transmit For simplicity: MSS = 1B Sender Receiver cwnd = 3 Ack =2 Ack =3 No need to wait for RTO, as three dupacks arrive RTO Fast Retransmit...... Time Time 28
TCP Selective Acknowledgements [RFC 2018, RFC 6675] Duplicate ACKs indicate only one missing segment (next expected) Similarly each cumulative ACK during recovery (i.e., NewReno partial ack) indicates only one missing segment (next expected) ==> NewReno Fast Recovery can recover only one segment per RTT ==> In RTO recovery several segments are often unnecessarily retransmitted Selective Acknowledgement (SACK) option allows identifying several missing segments with a single dupack 29
SACK option (RFC 2018) TCP SACK-permitted option type =4 length=2 1 byte 1 byte Used in connection establishment (with SYN segments) to negotiate the use of SACK option TCP SACK option Carries information about sequence number ranges that have arrived successfully, but out-of-order, order, at the receiver (stored in the receive buffer) 30
TCP SACK option type =5 length=n Beginning of the 1 st block (seq.no) End of the 1st block (seq.no+1) Beginning of the 2 nd block (seq.no) End of the 2 nd block (seq.no+1) Beginning of the 3 rd block (seq.no) End of the 3 rd block (seq.no+1) One TCP segment may carry max 4 SACK blocks, as max 40 bytes have been reserved for TCP options (use of other TCP options reduces this). 31
Sending SACK option Always, when acknowledging an out-of-order order segment (i.e., always, when acknowledging other than the highest sequence number that has arrived) SACK option includes as many latest sequence number ranges as possible Each arrived segment (block) becomes reported several times (i.e., repeated with the later ACKs) First block in the SACK option includes the segment that triggered the acknowledgement SACK information is only informative for the TCP sender TCP sender must not remove a segment from its send buffer until a cumulative ACK acknowledging it arrives 32
SACK-based Recovery [RFC 6675] With help of the SACK option a TCP sender may recover more than one lost segment within one RTT (cf. NewReno) TCP sender maintains scoreboard data structure with the retransmission queue (updated on arrival of an ACK and after transmitting a segment) SACKed: information whether a SACK block corresponding to the segment has been received HighACK: : sequence number of the highest byte of data that has been cumulatively ACKed HighRxt: : highest sequence number that has been retransmitted during the current loss recovery phase HighData: : highest sequence number transmitted pipe: an estimate of the number of bytes (segments) outstanding in the network cwnd limits transmission of segments during loss recovery; Ifcwnd pipe >= 1 SMSS,, sender can transmit segments If there are segments that are considered lost, retransmit as many lost segments as cwnd allows - a segment is considered lost, if at least 3 discontinuous SACKed sequences have arrived above the segment or more than 2 * SMSS bytes with sequence numbers above the segment have been SACKed If there are not enough lost segments to transmit, transmit as many new data segments ascwnd allows If no lost nor new segments to transmit, follow the rules in Steps (3) & (4) of NextSeg() in RFC 6675 to retransmit one data segment not considered lost 33
SACK Fast Retransmit (RFC 6675) If at least 3 segments above HighAck+1 has been SACKed (*): 1. SetRecoveryPoint = HighData; 2. Set ssthresh = cwnd = FlightSize / 2 3. Retransmit the first unacknowledged segment and HighRxt = highest sequence number in the retransmitted segment 4. Recalculate a new value for pipe: Includes all data (segments) that have been sent but not ACKed (either cumulatively or SACKed), but not segments that are considered lost ( = at least 3 later segments after the segment have reached the receiver and have been SACKed) Includes all retransmitted data (segments) (HighACK < seqno <=HighRxt) 5. Ifcwnd pipe >= 1 SMSS,, sender can transmit segments In the first place retransmit lost segments then transmit new data set As many as allowed bycwnd If no lost segments nor new data, send one segment as per Steps (3) & (4) of NextSeg() After transmitting, update HighRxt,HighData andpipe (*) On each ACK with SACKed data, use Limited Transmit to send at most one SMSS of new data ( ifcwnd pipe >= 1 SMSS ) 34
SACK Fast Recovery (cont d) On each arriving ACK: A. If cumulative ACK number >RecoveryPoint Recovery completed, exit FastRecovery B. If cumulative ACK number <=RecoveryPoint Update scoreboard with SACK information Update pipe (like in step 4 above) C. Ifcwnd pipe >= 1 SMSS,, sender can transmit segments In the first place retransmit lost segments then transmit new data As many as allowed bycwnd If no lost segments nor new data, send one segment as per Steps (3) & (4) of NextSeg() After transmitting, update HighRxt,HighData andpipe 35
segments_sent = sent but not cumulatively acknowledged Scoreboard X = segment dropped by net S = SACKed (SACK-block received for the segment) Seqno: 13 12 11 10 9 8 7 6 5 4 3 2 1 S X S S S S lost X S S X lost HighData RecoveryPoint HighRxt HighAck cwnd = 6; an ACK with SACK for segment 10 arrives: pipe = segments_sent SACKed lost + retransmitted = 12 7 2 + 2 = 5 -> Send new data segment (13) Update pipe -> pipe = 6; update HighData->13 36
Fast Retransmit & Fast Recovery (SACK) Sender Receiver ack=2 For simplicity: MSS = 1B ack=2; SACK 3 cwnd = 6 ack=2; SACK 3, 5 ack=2; SACK 3, 5, 7 RecoveryPoint=7; ssthresh=3 cwnd = 3; pipe =2 pipe =3 ack=4; SACK 5, 7 cwnd = 3; pipe = 2 pipe = 3 ack=4; SACK 5, 7-8 Recovery Done pipe = 1 pipe = 2 pipe = 3 pipe = 2 pipe = 3 pipe = 1 pipe = 2 pipe = 3 pipe = 2 Time ack=6, SACK 7-8 ack=6, SACK 7-9 ack=6, SACK 7-10 ack=11 ack=12 Time 37
THE END EMail: Markku.Kojo@cs.Helsinki.FI 38