Networking Subsystem in Linux Manoj Naik IBM Almaden Research Center
Scope of the talk Linux TCP/IP networking layers Socket interfaces and structures Creating and using INET sockets Linux IP layer Socket calls in kernel Zerocopy networking
Linux TCP/IP Networking Layers Network Applications User Kernel Socket Interface BSD Sockets INET Sockets Protocol Layers TCP IP UDP ARP Network Devices PPP SLIP Ethernet
BSD Socket Interface Inter-process communication mechanism Address families UNIX, INET, X25... Socket types Stream, Datagram, Raw, Packet Processes in client-server model IP address, ports
Linux Socket Structures protocols vector address families and protocols name (eg. INET), init routine proto_ops data structures registered protocol operations eg. sendmsg, recvmsg
INET Socket Layer count close_on_exec open_fs fd[0] fd[1] file f_mode f_pos f_flags f_count f_owner f_op inode fd[255] files_struct f_inode f_version type ops data socket SOCK_STREAM Address Family socket operations type SOCK_STREAM protocol socket sock
Using BSD Sockets in Linux Creating a BSD socket Binding an address to an INET socket Making a connection on an INET socket Listening on an INET socket Accepting connection requests
Creating a BSD Socket int socket(family, type, protocol) Search pops for matching address family sock->proto_ops = pops[family] Allocate a new socket data structure VFS inode Call family-specific create routine Create and initialize new file structure file->f_ops = sock_ops
Binding an Address to INET Socket int bind(sockfd, sockaddr, socklen) Mostly handled by INET layer sock->state = TCP_CLOSE sockaddr contains IP address, port number sock->sk->saddr = <IP address> Routing of received packets Hash tables in TCP/UDP for address lookups Direct to correct socket/sock pair.
Making a Connection on INET Socket int connect(sockfd, sockaddr, len) socket->state should be SS_UNCONNECTED UDP connect setup addresses of remote application cache route in ip_route_cache TCP connect build TCP connect message add sock to tcp_listening_hash
Listening on INET Socket int listen(sockfd, backlog) Set sock->state = TCP_LISTEN Add sock to tcp_bound_hash Build a new sock for TCP bottom-half of TCP Clone incoming sk_buff and queue it on receive_queue
Accepting Connection Requests int accept(sockfd, sockaddr, len) TCP only Clone listening socket Add process to a wait queue and schedule On connect request return sock to INET socket layer Link sock to socket return socket fd to application
Linux IP Layer Interface with network devices Socket buffers Receiving IP packets Sending IP packets Data fragmentation
Network Device Structure dev_queue_xmit deliver packets netif_rx receive & queue packets Methods and variables struct device Initialization routine hard_start_xmit deliver frames dev_interrupt collect rx frames Physical Device and Media
Socket Buffers sk_buff next prev dev head data tail end len Packet to be transmitted truesize
sk_buff operations skb_push add data or headers to the start of data to be transmitted skb_pull remove data or headers from the start of received data skb_put add data to the end of packet to be transmitted skb_trim remove data or tailer from the received packet
Receiving IP Packets Network device converts received data into sk_buffs sk_buffs are added to backlog queue Bottom-half handler is flagged to run Backlog queue is processed IP fragments (ipq) are put in ipqueue list
Sending IP Packets Determine packet route for IP, use rtable Build sk_buff to contain data and protocol headers source IP address address of network device prebuilt hardware header (cached)
Socket Buffer Management void append_frame(char *buf, int len) { struct sk_buff *skb = alloc_skb(len, GFP_ATOMIC); if (skb == NULL) dropped++; else { skb_put(skb, len); memcpy(skb->data, buf, len); skb_append(&list, skb); } } void process_queue(void) { struct sk_buff *skb; while ((skb = skb_dequeue(&list))!= NULL) { process_data(skb); kfree_skb(skb, FREE_READ); } }
Higher Level Support Routines Receive Data sk = find_socket(something); if (sock_queue_rcv_skb(sk, skb) == -1) { dropped++; kfree(skb, FREE_READ); return; } Transmit Data skb = sock_alloc_send_skb(sk,...) if (skb == NULL) return -err; skb->sk = sk; skb_reserve(skb, headroom); skb_put(skb, len); memcpy(skb->data, data, len); protocol_do_something(skb);
Data Fragmentation Packet size for network device smaller than transmit data fragment fields in protocol header MTU for device determined from routing tables Each fragment represented by sk_buff Received fragments (ipq) stored in ipqueue list
Address Resolution Protocol (ARP) Translate IP address to physical hardware address Header rebuilding routine for translation ARP request IP address broadcast ARP response Hardware address from owner arp_table last used, updated IP address, h/w address & header timer, retries, sk_buff queue
Socket System Calls (TCP) Server socket() bind() listen() accept() Client socket() blocks until connection from client connection establishment connect() read() data (request) write() process request write() data (reply) read()
Invoking Socket Calls in Kernel Server socket() bind() err = sock_create(pf_inet, SOCK_STREAM, IPPROTO_TCP, &sock); err = sock->ops->bind(sock, &sin, sizeof(sin)); listen() accept() err = sock->ops->listen(sock, 48); err = sock->ops->accept(sock, newsock, O_NONBLOCK); Client socket() blocks until connection from client connection establishment connect() read() process request data (request) rc = sock_recvmsg(sock,&msg, len, flags); write() write() rc = sock_sendmsg(sock,&msg,len); data (reply) read()
Performance bottlenecks Per-packet and per-byte costs Data touching overheads Copying data between system and application buffers TCP Checksumming data integrity per byte or packet Zerocopy approach
Zerocopy I/O Memory mapped files access to static mappable objects Raw disk I/O synchronous Raw writes user buffer accessed directly by disk driver request blocks until end of data transfer Raw reads read buffer posted before disk I/O
Issues with Zerocopy in TCP Transmit side Retain user data for possible retransmission copy user data into a kernel buffer and put in outbound queue return asynchronously to user high throughput, buffer reuse
Issues with Zerocopy in TCP Receive side Packets arrive at network interface asynchronously user read buffers not usually posted limited interface memory copy incoming data into a kernel buffer and put in inbound queue
Zerocopy Schemes User accessible interface memory pre-mapping into user and kernel address spaces no copies complicated hardware support cache flushing intelligence in adapters to direct data substantial software changes special buffer management calls Limited interface memory memory leaks
Zerocopy Schemes Kernel-network shared memory DMA or program I/O to move data between interface memory and user buffers No changes in existing applications Co-management of buffer pool between kernel and interface hardware Pinning of user pages for DMA Retransmit buffers in buffer pool
Zerocopy Schemes User-kernel shared memory APIs with shared semantics between user and kernel address spaces DMA between shared memory and network interface Fast buffers (fbufs) - per process buffer pool pre-mapped in user and kernel Application compatibility problems Buffer pool fragmentation Targeted DMA transfer to correct memory pool
Zerocopy Schemes User-kernel page remapping + COW DMA transfer between interface memory and kernel buffers Data "transfer" through page remapping edit MMU tables Copy-on-write (COW) on transmit side Expensive VM operations Operations on page boundaries
Hardware Checksumming Calculate data checksums during DMA transfers Software checksums can be expensive with cold caches Modern interface adapters (Gbit) perform checksumming in hardware
Zerocopy in Linux Page, offset, length tuples Scatter-gather lists writepage, sendfile TCP socket options MSG_MORE TCP_CORK
Using sendfile() Original code while ((c = read(filefd, buf, sizeof(buf))) > 0) { if ((d = write(sockfd, buf, c)) < 0) break; bytes += c; } Modified code if (fstat(filefd, &statbuf) < 0) break; fsize = statbuf.st_size; bytes = sendfile(sockfd, filefd, &offset, fsize);
Using zerocopy TCP sendfile() Original code rc = read(filefd, packet->data, size); packet->hdr = build_header(rc); rc = send(sockfd, packet, packet_size, 0); Modified code /* Assume that file is locked and size won't change in the process of doing sendfile */ fstat(filefd, &statbuf); packet->hdr = build_header(statbuf.st_size); rc = send(sockfd, packet->hdr, hdrsize, MSG_MORE); rc = sendfile(sockfd, filefd, &offset, statbuf.st_size);