Networked Systems and Services, Fall 2018 Chapter 2. Jussi Kangasharju Markku Kojo Lea Kutvonen

Networked Systems and Services, Fall 2018 Chapter 2 Jussi Kangasharju Markku Kojo Lea Kutvonen

Outline Physical layer reliability Low level reliability Parities and checksums Cyclic Redundancy Check (CRC) Error detecting and correcting codes Automatic repeat request (ARQ) Network layer reliability Forward Error Correction Network coding

Physical layer reliability Sender, receiver and channel Data sent electrically or optically Sent as bits or combinations of bits Various techniques: Modulation Pulses In the rest, we do not really cover this Does not interact (much) with the higher layers

Link layer basics Unit of interest: A sequence of bits, e.g., Character Longer block of bits Bits might be received correctly or incorrectly Physical layer provides no guarantees Obvious goal: Get the right bits at the receiver How to: Detect? Correct? Recover?

What do we want to achieve? Detect: Receiver is able to determine the presence of error Either an error in general or in specific bits Correct: Receiver is able to correct error(s) on its own No need to contact sender Recover: Receiver must contact sender and ask for retransmission Needed anyway unless correction is feasible Which one is most important? Absolute minimum: Detection and recovery Sometimes recovering too slow à Need correction

Detection basics How can receiver detect an error? Which transmitted sequence is correct? S 1 0 0 0 0 0 1 R Not possible unless we send additional information S 1 0 0 1 0 0 1 R All detection (and correction) mechanisms require sending additional, redundant data

Simple solution Send every bit 3 times Triple-modular redundancy Commonly used in many areas for safety-critical systems (Example: https://en.wikipedia.org/wiki/triple_modular_redundancy) Pros: Can detect and correct 1 bit errors Cons: Lot of overhead, send 3 times the needed bits Not very good with burst errors

Interleaving to the rescue! Interleaving means not sending bits in order (Example: https://en.wikipedia.org/wiki/forward_error_correction#interleaving) Pros: More robust against burst errors Cons: Still high overhead Slower to receive data; cannot detect until everything is received Interleaving used in media delivery and video encodings

Checksums Checksum: Mathematical function of the data Added to the transmission (data + checksum) by sender Receiver can calculate same function If match, then good, if not, then error Checksums used in many, many places Bank account numbers, credit card numbers, ID numbers, Goal often: Protection against typos and stupid forgers Algorithms public and known

Parity bit Parity is the simplest form of checksum Adds one bit of redundancy to data Two variants: Even and odd parity Calculation: Calculate number of 1 bits in data Set parity bit to 0 or 1 so that total number of 1 bits is even or odd (depending on which parity is used) Receiver calculates number of 1 bits in data and checks result (Example: https://en.wikipedia.org/wiki/parity_bit)

Parity properties Can detect all single bit errors Actually, can detect all cases with odd number of bit errors Cannot detect an even number of bit errors Errors cancel each other out in calculation Typically used on byte level Cannot correct errors, must retransmit in case error detected Use cases: RAID disks, memory buses, serial data transmission

Cyclic Redundancy Check (CRC) Burst errors very common in data transmission Cyclic codes protect against them and are fast and easy to calculate Parity is a special case of CRC (1 bit CRC) Generator polynomial of degree n Message divided by generator polynomial Essentially bitwise XOR (fast to calculate) Remainder becomes the checksum (Example, https://en.wikipedia.org/wiki/cyclic_redundancy_check)

CRC use cases Data transmission and storage Ethernet uses CRC-32 Various CRCs used in mobile networks gzip and bzip2 use CRC-32 (same as Ethernet) iscsi uses CRC-32 (different from Ethernet) Also used in train communication and aviation Only detection, no correction of errors

Error correction Detection is mandatory, but needs recovery to proceed How about also correcting detected errors? Pros: No need to contact sender again, saves one RTT Cons: Needs more redundancy à Reduced goodput Error correction on low level here Forward Error Correction (FEC) across packets comes later

Hamming codes How to indicate which bit was in error? Repetition codes (e.g., triple redundancy) not very efficient Basic idea: We have n bits in message We need k error correction bits, such that 2 k n Then we can determine which bit was in error (single bit error) So called (n, (n-k)) code Code rate is k/n (= goodput) Set of all possible sent data Remember: Low level communication, maybe 7 bits?

Hamming distance Metric to determine difference between two strings Must be of equal length For us: Received data and correct data Distance is the number of substitutions needed to go from received data to correct data How many errors have happened? Set of possible original data has some Hamming distance Hamming distance 2: Can detect single bit errors Hamming distance 3: Can correct single bit errors

Hamming (7, 4) We send data items of 4 bits Need 3 bits for error correction Why? Construction of code: 1. Number bits in binary, starting from 1 2. Power of 2 bit positions are parity bits 3. All others are data bits 4. Each data bit covered by at least 2 parity bits as follows: Parity bit 1 covers all bits with LSB = 1 Parity bit 2 covers all bits with second-lsb = 1 Parity bit 3 covers all bits with third-lsb = 1, and so on

Hamming (7, 4) construction example Bit number 1 2 3 4 5 6 7 Data/Parity p1 p2 d1 p4 d2 d3 d4 P1 covers X X X X P2 covers X X X X P4 covers X X X X If all parity bits are correct, then no error Sum positions of erroneous parity bits to find out location of actual data bit in error If only one parity bit indicates error, then error is in parity bit Hamming (8, 4) can also detect two bit errors (includes one overall parity bit)

Recovering from errors Error correcting codes on bit level useful, but have high overhead Hamming (7,4) only sends 4/7 = 57% useful information Typically communication channels not that unreliable What could be possible exceptions? Always sending redundant information wastes resources Focus on detection and subsequent recovery Detection usually needs less redundancy Recovery = Receiver asks sender to retransmit Key issues: How, when, and how much?

Recovery basics What are possible errors? Corrupted data Can be identified via checksums, etc. The stuff we have just seen Receiver asks sender to retransmit corrupted data Lost packet How can a receiver know to expect a packet? How can a receiver know to ask for something it doesn t know exists?

ARQ: Automatic Repeat Request Every packet must have a sequence number Sequence number must be unique across packets in-flight Sequence numbers can be re-used if no risk of confusion Receiver acknowledges reception of packet number X Sender knows packet was successfully received Sender sends next packet S P:123 ACK:123 P:124 R

ARQ: Problems Lost packet Receiver cannot acknowledge S P:123 R How to solve this? Timeout Sender waits t seconds for ACK No ACK à Retransmit same packet P:123 Everything resumes as usual ACK:123

ARQ: More problems What if ACK is lost? What is the difference for the sender? S P:123 R No difference, timeout, retransmit How about for receiver? Receives same packet twice Must keep track of received packets Prune duplicates ACK:123 P:123 P:123 P:123

ARQ: More issues What if no ACK comes despite multiple retransmissions? Maximum number of retransmissions If no success, then connection is assumed to be lost Another (longer) timeout If this is triggered, connection is assumed to be lost How much bookkeeping for receiver for duplicate packets? Sliding window, i.e., define maximum number of outstanding packets Limits need for bookkeeping at receiver Also defines maximum number needed for sequence numbers

Types of ARQ ARQ typically exists both on link and transport layers Here, general properties of ARQ Later a practical case with TCP Stop-and-Wait Send one packet, wait for ACK Only then send new packet Simple to implement Very inefficient How to make it more efficient?

Go-back-N ARQ S P:123 R P:124 Window for unacked packets Sender can send this many at once P:125 Receiver acks last received packet ACK:125 Sender sends next window of packets ACK can be for next expected packet For lost packets Receiver acks last consecutive packet Sender resumes from that point Retransmits all packets after missing one, even if they were correctly received P:126 P:127 P:128 ACK:126

Go-back-N issues More efficient than Stop-and-Wait Can send one window worth of packets per RTT Stop-and-Wait has N = 1 Issue: Everything after lost packet is sent again Worse: ACK is lost Worst case: Whole window is sent twice Especially bad if window is big How to solve this problem?

Selective Repeat ARQ S P:123 R P:124 Sender sends one window of packets P:125 No errors: Receiver acks them all Either individually or cumulative Must make sure this works Error: Receiver tells sender which packets were missing/wrong Either acks successes or nacks failures ACK:125 P:126 P:127 P:128 Sender only retransmits failed packets NACK:127 P:127

Comparison Go-back-N Pros: Easy to implement, not much bookkeping needed (one number at receiver) Works if errors are rare enough Cons: More transmission overhead for errors Sender needs to keep track of all packets in window Selective Repeat Pros: Only retransmits data that didn t make it the first time Most efficient use of network resources Cons: Sender needs to keep track of all packets in window Receiver needs to keep track of all packets in window

ARQ summary Basic recovery mechanism Used both on link and transport layers Two main variants: Go-back-N and Selective Repeat TCP was originally Go-back-N Nowadays extensions for Selective Repeat (SACK) Later there will be a discussion on TCP Another variant: Hybrid ARQ Combines ARQ with Forward Error Correction We will see this later in the chapter

Network level reliability

Network level solutions Forward Error Correction (FEC) Basics Reed-Solomon codes Fountain codes Network coding Hybrid ARQ

FEC basics Add redundancy to sent data to allow receiver to recover from errors Error can be corrupted data or lost packet Is there a difference between these two? Two basic ways of error correction: Add redundancy to allow corrupted data to be recovered Add additional data to allow recovery of completely lost data Forward Error Correction typically means the second option When to use FEC? Common use case: Retransmission is impossible or too expensive

Types of FEC Block codes Fixed size blocks or packets Lost or corrupted data Hamming codes, Reed Solomon Convolutional codes Bit streams of arbitrary length Erasure codes Specifically against lost data Fountain codes

Block codes Block codes divide data into fixed size blocks (e.g., packets) Assume k bits in size Then add redundancy to produce n bits of output Rate of code: R = k/n Large R: Not much redundancy, opposite for small R Tradeoff between n and resulting overhead Lots of different block codes in existence

Simple example Our programming assignment has a simple block code Two inputs: Packets A and B Redundancy: C = A XOR B Rate: 2/3 Three packets A, B, and C form a group For receiver: Receive any 2 packets out of the group of A, B, C Reconstruct A and B (directly or XORing) No redundancy across different groups of 3 packets

Reed Solomon codes Defined by Reed and Solomon in 1960 Where are they used? CD, DVD, QR codes, DVB, DSL, space communications, Pretty much everywhere J Operates on multi-bit symbols (read: group of bits) Burst error affect multiple bits But hopefully only one symbol Good error correction properties

Tornado codes Tornado codes are like Reed Solomon codes Less efficient on space More efficient on speed Tornado codes based on layered approach All layers (but one) use Low-density Parity Check (LDPC) code Efficient, but can fail Last layer uses Reed Solomon Slower but optimal Many other similar codes exist

Erasure codes Goal: Recover lost data Reed Solomon codes are one example of this category Polynomial interpolation Basic idea: Message of k packets (also called symbols) Encoded into n packets Receiver can reconstruct from any k packets received (out of n) Rate: k/n Reception efficiency: k /k Ideally k = k; any received k symbols sufficient

Fountain codes Rateless erasure codes Unlimited number of encoded packets n source packets Need any n (or close to n) encoded packets to decode original data Example n = 3 S I H G F E D C B A R3 R2 R1

Raptor codes Raptor = Rapid Tornado First fountain codes with linear encoding and decoding Original message k symbols Receive any k encoded symbols à High probability of decoding For k received symbols, less than 1% chance of error For k+2 received symbols, less than 1 in million chance of error Symbol can be of any size (byte, packet, )

Use of fountain codes Useful for broadcast content Same content being broadcast to multiple recipients Especially when receivers can join at any time Also known as data carousel RFC 5053 has been widely adopted Used by 3GPP DVB for handheld devices DVB-IPTV, TV over IP networks Updated RaptorQ in RFC 6330

Usefulness of fountain codes Low overhead Almost close to ideal Receiver can act independently No need to contact sender for recovery Not enough packets à Receive some more Works best for broadcast or multicast No need to know identities of receivers Works for unicast as well Efficiency depends on channel and many other factors

Network coding Not so much for reliability as for improved performance Improved scalability and throughput Basic idea: Combine multiple packets together Theoretical property: Can achieve maximum throughput for single-source multicast No proof for multi-source cases, though

Butterfly network A A B B A A B B A B A B A B A B

Network coding in practice Routing of packets only: Central link can send either A or B, not both Network coding makes a combination of A and B Send combination over bottleneck link Receivers get A or B separately, can decode other Sender has N packets to send Create linear combinations of packets with random coefficients Coefficients chosen from a Galois field If received packets are linearly independent, decoding successful If not, unlikely to be able to decode anything Solution: Continue to send more

Network coding example Three original packets A, B, and C Select coefficients to create 3 encoded packets D, E, F D = xa + yb + zc E = ka + lb + mc F = na + ob + pc If this set of linear equations has a unique solution, then code works We can create further packets G, H, I, with different coefficients Receiver needs enough packets to solve A, B, and C At least 3, could be more depending on the coefficients

Network coding article C. Gkantsidis, P. Rodriguez, Network Coding for Large Scale Content Distribution, IEEE Infocom 2005 Next article essay to be completed Article discusses how to use network coding in large scale content delivery systems Similar to BitTorrent Was planned to be used for software updates See Moodle and announcements later about deadline and link

Combining different mechanisms How about combining ARQ and FEC? Let s try a layered approach Put one as layer on top of the other Both working independently What could go wrong?

FEC on top of ARQ First layer is FEC Add FEC to packets from application Pass them to ARQ which tries to get them through What is the problem? ARQ tries to get all packets through But whole idea of FEC is to allow for loss Not much benefit

ARQ on top of FEC How about the other way around? ARQ gets packets from application Lower layer uses FEC to ensure delivery What is the problem? FEC may add delay à Possible timeouts Longer timeouts à Slower recovery à Lower throughput Unnecessary retransmissions (with FEC) Neither solution is without problems

Interactions between mechanisms Different reliability mechanisms may interact Like previous example Not always a good idea to enable everything Must be aware of (subtle) effects in chosen design Usually no clear optimal solution Can make new combinations of solutions These kinds of interactions can happen between any solutions

Hybrid ARQ A different (smarter?) way of combining ARQ and FEC Use both at the same time Encode packets with FEC for error correction Use ARQ and its error detection as a fallback Pros: Works well over poor quality channels Cons: Adds significant overhead for good quality channels How to tweak?

Hybrid ARQ Adjust amount of FEC based on observed channel quality At first, only use ARQ If everything goes smoothly, remain with ARQ If there are errors, start including FEC (below ARQ) Adjust amount of FEC based on need Soft combining: Receiver keeps incorrectly received packets Attempts to combine with future packets Used for example in HSDPA

Summary Link level and network level mechanisms Error detection: Parities, checksums, CRC Error correction: Hamming codes Recovery: ARQ Forward Error Correction Network coding Hybrid ARQ