Receive Livelock Robert Grimm New York University
The Three Questions What is the problem? What is new or different? What are the contributions and limitations?
Motivation Interrupts work well when I/O events are rare Think disk I/O By comparison, polling is expensive After all, CPU doesn t really do anything useful when polling To achieve same latency as interrupts, need to poll thousands of times per second But, the world has changed: it s all about networking Multimedia, host-based routing, network monitoring, NFS, multicast, broadcast all lead to higher interrupt rates Once interrupt rate is too high, system becomes overloaded and eventually makes no progress
Avoiding Receive Livelock Hybrid design Poll when triggered by interrupt Interrupt only when polling is suspended Result Low latency under low loads High throughput under high loads Additional techniques Drop packets early (i.e., those with least investment) Connect with scheduler (i.e., give resources to user tasks)
Requirements Acceptable throughput Keep up with Maximum Loss Free Receive Rate (MLFRR) Keep transmitting as you are receiving Reasonable latency, low jitter Avoid long queues Fair allocation of resources Continue to service management and control tasks Overall system stability Do not impact other systems on the network Livelock may look like link failure, lead to more control traffic
Interrupts: Packet Arrival Packet arrival signaled through an interrupt Associated with fixed Interrupt Priority Level (IPL) Handled by device driver Placed into queue, dropped if queue is full Protocol processing initiated by software interrupt Associated with lower IPL May be batched: process several packets before returning Gives absolute priority to incoming packets But modern systems have large network card buffers, DMA
Interrupts: Receive Livelock If packets arrive too fast, system spends most time servicing packet received interrupts After all, they have absolute priority No resources left to deliver packets to applications After reaching MLFRR, throughput begins to fall again Eventually reaches (!) But, doesn t batching help? Can increase MLFRR But cannot, by itself, avoid livelock
Interrupts: Overload Impact Packet delivery latency increases Packets arriving in bursts are processed in bursts Link-level processing: copy into kernel buffer and queue Dispatch: queue for delivery to user process Delivery: schedule user process Transmits may starve Transmission usually performed at lower IPL than reception Why do we need interrupts for transmission? Don t we just write the data to the interface and say transmit? But system is busy servicing packet arrivals
Better Scheduling Limit interrupt arrival rate to prevent saturation If internal queue is full, disable receive interrupts For the entire system? Re-enable interrupts once buffer space becomes available or after timeout Track time spent in interrupt handler If larger than specified fraction of total time, disable interrupts Alternatively, sample CPU state on clock interrupts When to use this alternative? Why does it work?
Better Scheduling (cont.) Use polling to provide fairness Query all sources of packet events round-robin Integrate with interrupts Reflects duality of approaches Polling for predictable events, interrupts for unpredictable events Avoid preemption to ensure progress Do most work at high IPL Do hardly any work at high IPL Integrates better with rest of kernel Sets service needed flag and schedules polling thread Gets rid of what?
Experimental Setup IP packet router built on Digital Unix (DEC OSF/1) Bridges between two (otherwise unloaded) ethernets Runs on DECstation 3/3 running Digital Unix 3.2 Slowest Alpha-based host available (around 1996) Load generator send 1, UDP packets 4 bytes of data per packet
Unmodified Kernel 5 Output packet rate (pkts/sec) 4 3 2 1 With screend Without screend 2 4 6 8 1 12 Input packet rate (pkts/sec) With screend, peak at 2 psec, livelock at 6 psec Without, peak at 47 psec, livelock at 14,88 psec
Unmodified Kernel in Detail Receive interrupt handler Increasing interrupt priority level ipintrq IP forwarding layer output ifqueue Transmit interrupt handler Packets only discarded after considerable processing I.e., copying (into kernel buffer) and queueing (into ipintrq)
Modified Kernel Increasing interrupt priority level Modified receive interrupt handler Received packet processing (polled) Unmodified receive interrupt handler ipintrq IP forwarding layer Unmodified transmit interrupt handler output ifqueues Modified transmit interrupt handler Transmit packet processing (polled) Where are packets dropped and how? Why retain the transmit queue? Modified Path
Performance w/o screend Output packet rate (pkts/sec) 6 5 4 3 2 1 Polling (no quota) Polling (quota = 5) No polling Unmodified 2 4 6 8 1 12 Input packet rate (pkts/sec) Why do we need quotas? Why is the livelock worse for the modified kernel?
Performance w/ screend 3 Output packet rate (pkts/sec) 25 2 15 1 5 Polling w/feedback Polling, no feedback Unmodified 2 4 6 8 1 12 Input packet rate (pkts/sec) Why is polling not enough? What additional change is made?
Packet Count Quotas 6 Output packet rate (pkts/sec) quota = infinity quota = 1 packets quota = 2 packets quota = 1 packets quota = 5 packets 5 4 3 2 1 2 Without screend 5 4 Peak output rate (for any input rate) 3 2 Asymptotic output rate (for peak input rate) 5 4 6 8 Input packet rate (pkts/sec) 1 1 Polling quota 12 15 2 With screend 3 What causes the difference? Output packet rate (pkts/sec) Output packet rate (pkts/sec) 6 25 2 quota = infinity quota = 1 packets quota = 2 packets quota = 1 packets quota = 5 packets 15 1 5 2 4 6 8 Input packet rate (pkts/sec) 1 12
What About Other Tasks? So far, they don t get any cycles why? Solution: track cycles spent in polling thread 8 Disable input handling if over threshold Available CPU time (per cent) 7 6 5 4 3 2 1 threshold 25 % threshold 5 % threshold 75 % threshold 1 % 2 4 6 8 1 Input packet rate (pkts/sec)
Diggin Real Deep 16 14 12 lnput ether_output timer interrupt lnput Polling enabled Polling disabled Stack level 1 8 6 4 ether_output ether_output ether_output lnput lnput lnput lnput lnput 2 2 4 6 8 1 12 Time in usec What s wrong with this picture? How might you fix the problem, w/ what trade-off?
Network Monitoring tcpdump capture rate 1 8 6 4 2 Hypothetical loss-free rate /dev/null, with feedback disk file, with feedback /dev/null, no feedback disk file, no feedback 2 4 6 8 1 12 Input rate (packets per second) What is the difference from the previous application? Where are the MLFRR and the saturation point?
What Do You Think?