Zero-Copy Socket Splicing Alexander Bluhm bluhm@openbsd.org Sunday, 29. September 2013
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Application Level Gateway User Land Kernel Application TCP/UDP Network IP Data Link Physical Relay Socket Splicing Packet Filter
Persistent HTTP Filtering content length content length Body Header Body Header copy copy copy filter copy copy filter
HTTP Socket Splicing splice length splice length User Land Kernel Header filter Header filter Body splice Body splice
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
MBuf Data mbuf m hdr m data m len m dat ether header ip header udp header size 256 42 size 236 size 42
MBuf Data Chaining mbuf m hdr m next m data m len m pkthdr len m pktdat size 256 42 142 size 196 mbuf m hdr m next m data m len m dat payload size 256 NULL 100 size 236 ether header ip header udp header size 42 size 100
MBuf Packet Chaining mbuf m hdr m next m nextpkt m pkthdr mbuf m hdr m next m nextpkt mbuf m hdr m next m nextpkt mbuf m hdr m next m nextpkt m pkthdr mbuf m hdr m next m nextpkt
MBuf Cluster size 2048 mbuf m hdr m data m len m pkthdr m ext ext buf ext size size 256 1400 2048 ether header ip header udp header payload size 1400
MBuf Cluster Copy mbuf m data ext buf ether header ip header udp header payload mbuf m data ext buf
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Packet Input User Land Kernel ether input() tcp input() inetsw[] internet protocol switch ip input() read() soreceive() ip interface receive queue, m nextpkt network driver interrupt handler socket receive buffer, m next
Packet Output write() sosend() tcp output() ip output() socket send buffer, m next ether output() interface send queue, m nextpkt if start() network driver start routine User Land Kernel
Data Copy read() copyout() uiomove() soreceive() so rcv tcp input() Relay write() copyin() uiomove() sosend() so snd tcp output()
Process Wakeup read() select() write() file descriptor struct socket soreceive() sosend() so rcv so snd sowwakeup() sorwakeup() ACK tcp input() tcp output()
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Socket Splicing setsockopt(so SPLICE) sosplice() somove() tcp input() so rcv sowwakeup() sorwakeup() ACK tcp input() so snd tcp output()
UDP Sockets soreceive() so rcv somove() sosend() udp input() udp output()
Layer read() Relaying write() soreceive() Socket Splicing so rcv tcp input() Forwarding ip input() ipintrq sosend() so snd tcp output() ip output() if snd
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Simple API Begin splicing from source to drain setsockopt(source fd, SO SPLICE, drain fd) Stop splicing setsockopt(source fd, SO SPLICE, -1) Get spliced data length getsockopt(source fd, SO SPLICE, &length)
Extended API struct splice { int sp_fd; /* drain */ off_t sp_max; /* maximum */ struct timeval sp_idle; /* timeout */ }; setsockopt(source fd, SO SPLICE, &splice)
Properties Splicing is unidirectional Invoke it twice for bidirectional splicing Process can turn it on and off Works for TCP and UDP Can mix IPv4 and IPv6 sockets
Unsplice Dissolve socket splicing manually read(2) or select(2) from the source EOF source socket shutdown EPIPE drain socket error EFBIG maximum data length ETIMEDOUT idle timeout
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Struct Socket struct socket {... struct socket *so_splice; struct socket *so_spliceback; off_t so_splicelen; off_t so_splicemax; struct timeval so_idletv; struct timeout so_idleto;... };
sosplice(9) Protocol must match Sockets must be connected Double link sockets Move existing data
somove(9) Check for errors Check for space Handle maximum Handle out of band data Move socket buffer data
sounsplice() Manual unsplice Cannot receive Cannot send Maximum Timeout Socket closed
sorwakeup() sowwakeup() Called from tcp input() Source calls sorwakeup() Drain calls sowwakeup() Both invoke somove(9)
Agenda 1 Motivation 2 Kernel MBuf 3 Packet Processing 4 Socket Splicing 5 Interface 6 Implementation 7 Applications
Relayd Plain TCP connections HTTP connections Filter persistent HTTP HTTP Chunking
Tests /usr/src/regress/sys/kern/sosplice/ 15 API tests 18 UDP tests 76 TCP tests perf/relay.c simple example BSD::Socket::Splice Perl API 28 relayd tests
Performance Factor 1 or 2 for TCP Factor 6 or 8 for UDP
Documentation Manpage setsockopt(2) SO SPLICE Manpage sosplice(9) somove(9)
Questions?