The Linux Network Subsystem Unable to handle kernel paging request at virtual address 4d1b65e8 Unable Covers to handle Linux kernel paging version request at virtual 2.6.25 address 4d1b65e8 pgd = c0280000 pgd = c0280000 Version 1.1 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>] lr : [<bf00901c>] Not tainted sp : c076df78 ip : 60000093 fp : c076df84 pc : [<c00116c8>] lr : [<bf00901c>] Not tainted 1
Rights to copy Attribution ShareAlike 2.0 You are free to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work Under the following conditions Attribution. You must give the original author credit. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. License text: http://creativecommons.org/licenses/by sa/2.0/legalcode This kit contains work by the following authors: Copyright 2004 2006 Michael Opdenacker michael@free electrons.com http://www.free electrons.com Copyright 2003 2006 Oron Peled oron@actcom.co.il http://www.actcom.co.il/~oron Copyright 2004 2008 Codefidence ltd. info@codefidence.com http://www.codefidence.com 2
What is Linux? Linux is a kernel that implements the POSIX and Single Unix Specification standards which is developed as an Open Source project. When one talks of installing Linux, one is referring to a Linux Distribution: a combination of Linux and other programs and library that form an operating system. Linux runs on 24 main platforms and supports applications ranging from ccnuma super clusters to cellular phones and micro controllers. Linux is 15 years old, but is based on the 40 years old Unix design philosophy 3
Layers in a Linux system Kernel Kernel Modules User programs C library System libraries Application libraries User programs Kernel C library 4
Kernel architecture App1 App2... C library User space System call interface Process management Memory management Filesystem support Filesystem types Device control Networking Kernel space CPU support code CPU / MMU support code Storage drivers Character device drivers Network device drivers Hardware CPU RAM Storage 5
Kernel Mode vs. User Mode All modern CPUs support a dual mode of operation: User mode, for regular tasks. Supervisor (or privileged) mode, for the kernel. The mode the CPU is in determines which instructions the CPU is willing to execute: Sensitive instructions will not be executed when the CPU is in user mode. The CPU mode is determined by one of the CPU registers, which stores the current Ring Level 0 for supervisor mode, 3 for user mode, 1 2 unused by Linux. 6
The System Call Interface When a user space tasks needs to use a kernel service, it will make a System Call. The C library places parameters and number of system call in registers and then issues a special trap instruction. The trap atomically changes the ring level to supervisor mode and the sets the instruction pointer to the kernel. The kernel will find the required system called via the system call table and execute it. Returning from the system call does not require a special instruction, since in supervisor mode the ring level can be changed directly. 7
Linux System Call Path Kernel do_name() sys_name() Function call Trap entry.s Task Glibc Task 8
Linux networking Subsystem Overview Stack <> App App 1 App2 App3 Socket Layer UDP TCP ICMP Networking Stack Driver <> Stack Driver <> Hardware IP Stack Driver Interface Driver Bridge 9
Network Device Driver Hardware Interface packet packet packet packet packet Tx Rx Send Free Send Free Send RcvOk SentOK RcvErr SendErr RecvCRC Free RcvOK Memory Access Memory mapped registers access Driver Interrupts packet packet packet packet Driver allocates Ring Buffers. Driver resets descriptors to initial state. Driver puts packet to be sent in Tx buffers. Device puts received packet in Rx buffers. Driver/Device update descriptors to indicate state. Device indicates Rx and end of Tx with interrupt, unless interrupt mitigation techniques are applied. DMA 10
Network Device Registration Each network device is represented by a struct net_device These are allocated using: struct net_device *alloc_netdev(size, mask, setup_func); size size of our priv data part mask a naming pattern (e.g. eth%d ) setup_func A functionthat set ups the rest of net_device. And is registered via a call to: int register_netdev(struct net_device *dev); 11
Network Device Initialization The net_device structure is initalized with numerous methods and flags by the setup function: open request resources, register interrupts, start queues. stop deallocates resources, unregister irq, stop queue. get_stats report statistics set_multicast_list configure device for multicast hard_start_xmit called by the stack to initiate Tx. IFF_MULTICAST Device support multicast IFF_NOARP Device does not support ARP protocol 12
Packet Representation We need to manipulate packets through the stack This manipulation involves efficiently: Adding protocol headers/trailers down the stack. Removing protocol headers/trailers up the stack. Packets can be chained together. Each protocol should have convenient access to header fields. To do all this the kernel uses the sk_buff structure. 13
Socket Buffers The sk_buff structure represents a single packet. This structure is passed through the protocol stack. It holds pointers to a buffers with the packet data. It holds many type of other information: Data size. Incoming device. Priority. Security... 14
struct sk_buff next: Next buffer in list prev: Previous buffer in list sk: Socket we are owned by tstamp: Time we arrived dev: Device we arrived on/are leaving by input_dev: Device we arrived on h: Transport layer header nh: Network layer header mac: Link layer header dst: Destination route cache entry sp: Security path, used for xfrm cb: Control buffer. Private data. len: Length of actual data data_len: Data length mac_len: Length of link layer header csum: Checksum local_df: Allow local fragmentation flag cloned: Head may be cloned (see refcnt) nohdr: Payload reference only flag pkt_type: Packet class fclone: Clone status ip_summed: Driver fed us an IP checksum priority: Packet queuing priority users: User count see {datagram,tcp}.c protocol: Packet protocol from driver truesize: Buffer size head: Head of buffer data: Data head pointer tail: Tail pointer end: End pointer destructor: Destruct function nfmark: Netfilter hooks private data nfct: Associated connection, if any ipvs_property: skbuff is owned by ipvs nfctinfo: Connection tracking info. nfct_reasm: Netfilter conntrack re assembly pointer nf_bridge: Saved data about a bridged frame tc_index: Traffic control index tc_verd: Traffic control verdict dma_cookie: DMA operation cookie secmark: Security marking for LSM 15
Socket Buffer Diagram headroom frag1 Note len... head data tail end... dev Ethernet IP TCP Payload Padding struct sk_shared_info frag2 frag3 Network chip must support Scatter/Gather to use of frags. Otherwise kernel must copy buffers before send! struct sk_buff 16
Socket Buffer Operations skb_put: add data to a buffer. skb_push: add data to the start of a buffer. skb_pull: remove data from the start of a buffer. skb_headroom: returns free bytes at buffer head. skb_tailroom: returns free bytes at buffer end. skb_reserve: adjust headroom. skb_trim: remove end from a buffer. 17
Operation Example: skb_put unsigned char *skb_put (struct sk_buff * skb, unsigned int len) Adds data to a buffer: skb: buffer to use len: amount of data to add This function extends the used data area of the buffer. If this would exceed the total buffer size the kernel will panic. A pointer to the first byte of the extra data is returned. 18
Socket Buffer Alignment CPUs often take a performance hit when accessing unaligned memory locations. Since an Ethernet header is 14 bytes network drivers often end up with the IP header at an unaligned offset. The IP header can be aligned by shifting the start of the packet by 2 bytes. Drivers should do this with: skb_reserve(net_ip_align); The downside is that the DMA is now unaligned. On some architectures the cost of an unaligned DMA outweighs the gains so NET_IP_ALIGN is set on a per arch basis. 19
Socket Buffer Padding The networking layer reserves some headroom in skb data. This is used to avoid having to reallocate skb data when the header has to grow. In the default case, if the header has to grow 16 bytes or less we avoid the reallocation. Unfortunately, this headroom changes the DMA alignment of the resulting network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive on some architectures. Therefore architecture can override this value, as long as at least 16 bytes of free headroom are there. 20
Socket Buffer Allocations dev_alloc_skb: allocate an skbuff for Rx netdev_alloc_skb: allocate an skbuff for Rx, on a specific device. Allocate a new sk_buff and assign it a usage count of one. The buffer has unspecified headroom built in. Users should allocate the headroom they think they need without accounting for the built in space. The built in space is used for optimizations NULL is returned if there is no free memory. Although these functions allocates memory it can be called from an interrupt. 21
sk_buff Allocation Example Immediately after allocation, we should reserve the needed headroom: struct sk_buff*skb; skb = dev_alloc_skb(1500); if(unlikely(!skb)) break; /* Mark as being used by this device */ skb >dev = dev; /* Align IP on 16 byte boundaries */ skb_reserve(skb, NET_IP_ALIGN); 22
Softnet Network stack is implemented as a pair of softirqs for parallelize packet handling on SMP machines: NET_TX_SOFTIRQ Feeds packets from network stack to driver. NET_RX_SOFTIRQ Feeds packets from driver to network stack. Like any other softirq, these are called on return from interrupt or via the low priority ksoftirqd kernel thread. Transmit/receive queues are stored in per cpu softnet_data. 23
Linux Contexts Interrupt Handlers Interrupt Context Hi prio tasklets SoftIRQs Net Stack Timers... Regular tasklets Kernel Space Network Interface Device Driver User Context User Space Process Thread Kernel Thread 24
Packet Reception The driver allocates an skb and sets up a descriptor in the ring buffers for the hardware. The driver Rx interrupt handler calls netif_rx(skb). netif_rx deposits the sk_buff in the per cpu input queue. and marks the NET_RX_SOFTIRQ to run. At SoftIRQ processing time, net_rx_action() is called by NET_RX_SOFTIRQ, which calls the driver poll() method to feed the packet up. Normally poll() is set to proccess_backlog() by net_dev_init(). 25
Packet Rx Overview 26
Packet Transmission Each network device defines a method: int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); This function is indirectly called from the NET_TX_SOFTIRQ Call are serialized via the lock dev >xmit_lock_owner The driver manages the transmit queue during interface up and downs or to signal back pressure using the following functions: void netif_start_queue(struct net_device *net); void netif_stop_queue(struct net_device *net); void netif_wake_queue(struct net_device *net); 27
Packet Tx Overview 28
NAPI Network New API Provides interrupt mitigation Requirements: A DMA ring buffer. Ability to turn off receive interrupts or events. It is used by defining a new method: int (*poll) (struct net_device *dev, int * budget); which is called by the network stack periodically when signaled by the driver to do so. 29
NAPI (cont.) When a receive interrupt occurs, driver: Turns off receive interrupts. Calls netif_rx_schedule(dev) to get stack to start calling it's poll method. The Poll method Scans receive ring buffers, feeding packets to the stack via: netif_receive_skb(skb). If work finished within budget parameter, re enables interrupts and calls netif_rx_complete(dev) Else, stack will call poll method again. 30
Routing After the socket buffer is delivered to a protocol handler the handler may decide to route the packet. The default routing uses the normal destination based routing with single table and a FIB destination cache. For each packet the routintg destination is looked up in the FIB cache. If found, the packet is sent to that interface driver. Otherwise a more costly routing decision based on rules occurs and the result is stored in the FIB. 31
What is Netfilter? Netfilter is a framework for packet mangling Each protocol defines "hooks" (IPv4 defines 5) which are well defined points in a packet's traversal of that protocol stack. At each of these points, the protocol will call the netfilter framework with the packet and the hook number. Parts of the kernel can register to listen to the different hooks for each protocol. When a packet is passed to the netfilter framework, it will call all registered callbacks for that hook and protocol. 32
Netfilter Architecture Ingres Pre Routing Route Forward Post Routing Egres Route Local In Local Out Local Sockets 33
Netfilter Hook Kernel code can register a call back function to be called when a packet arrives at each hook. and are free to manipulate the packet. The callback can then tell netfilter to do one of five things: NF_ACCEPT: continue traversal as normal. NF_DROP: drop the packet; don't continue traversal. NF_STOLEN: I've taken over the packet; stop traversal. NF_QUEUE: queue the packet (usually for userspace handling). NF_REPEAT: call this hook again. 34
IP Tables A packet selection system called IP Tables has been built over the netfilter framework. It is a direct descendant of ipchains (that came from ipfwadm, that came from BSD's ipfw IIRC), with extensibility. Kernel modules can register a new table, and ask for a packet to traverse a given table. This packet selection method is used for packet filtering (the `filter' table), Network Address Translation (the `nat' table) and general pre route packet mangling (the `mangle' table). 35
IP Tables and Netfilter Hooks Ingres Pre Routing Route Forward Post Routing Egres Conntrack Mangle Destination NAT Mangle Filter Route Conntrack Mangle Source NAT Filter Conntrack Mangle Local In Local Out Conntrack Mangle Destination NAT Filter Local Sockets 36
BSD Sockets Interface User space network interface: socket() / bind() / accept() / listen() Initalization, addressing and hand shaking select() / poll() / epoll() Waiting for events send() / recv() Stream oriented (e.g. TCP) Rx / Tx sendto() / recvfrom() Datagram oriented (e.g. UDP) Rx / TX 37
Simple Client/Server Clients socket s; char buf[256]; s =socket() connect(s, IP:port) while(ret!=0) ret = recv(s, buf) Server socket s 1, s 2... s n ; char buf[256]; s =socket() bind(s 1, IP:port) listen(s 1 ) while { select(s 1,s 2... s n ) if(s1) s n = accept(s 1 ) else while(ret!=0) ret = send(s n, buf) } 38
Simple Client/Server Copies Kernel Client Rx Tx Server Kernel Copy to user Copy from user... ret = recv(s, buf)...... ret = send(s, buf)... User space Application User space Application 39
BSD Sockets Interface Properties Originally developed by UC Berkeley research at the dawn of time Used by 90% of network oriented programs Context switch for every Rx/Tx Buffer copied from/to user space to/from kernel Standart interface across operating systems Simple, well understood by programmers 40
Zero Copy In kernel buffer that the user has control over. The buffer is implemented as a set of reference counted pointers which the kernel copies around without actually copying the data. splice() moves data to/from the buffer from/to an arbitrary file descriptor tee() Moves data to/from one buffer to another vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file. Can be used anywhere where a process needs to send something from one end to another, but it doesn't need to touch or even look at the data, just forward it. 41
Zero Copy of Example 1 Splice() * User space Only pointer is copied File Pointer to page cache page Data Socket Buf Pointer to page as part of frag list Kernel Memory HD Controller Copy (using DMA) Network Chip Hardware * In relaity you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers. 42
Zero Copy of Example 2 Mem write VMSplice() * User space Proccess page tables Only pointer is copied skb Pointer to page as part of frag list Kernel Memory Data Copy (using DMA) Network Chip Hardware * In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers. 43
Hardware Offloading Large receive offload supported (in software) TCP / Large Segment Offload supported (e.g. e1000 driver) No TCP Offload Engine support Security updates Point in time solution Different network behavior Hardware specific limits and resource based denial of service attacks http://www.linux foundation.org/en/net:toe 44
More Information Linux Foundation Net:Kernel Flow http://www.linuxfoundation.org/en/net:kernel_flow Zero Copy I: User Mode Perspective http://www.linuxjournal.com/article/6345 Understanding Linux Network Internals, O'Reilly Media 45
Use the Source, Luke! Many resources and tricks on the Internet find you will, but solutions to all technical issues only in the Source lie. Thanks to LucasArts 46
Copyrights and Trademarks Copyright 2004 2008 Codefidence Ltd. Tux Image Copyright: 1996 Larry Ewing Linux is a registered trademark of Linus Torvalds. All other trademarks are property of their respective owners. Used and distributed under a 47