The Linux Network Subsystem

Similar documents
Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Linux IP Networking. Antonio Salueña

Networking Subsystem in Linux. Manoj Naik IBM Almaden Research Center

jelly-near jelly-far

What is Netfilter. Netfilter. Topics

Network Implementation

Chapter 10: I/O Subsystems (2)

What is a Linux Device Driver? Kevin Dankwardt, Ph.D. VP Technology Open Source Careers

Chapter 10: I/O Subsystems (2)

Light & NOS. Dan Li Tsinghua University

19: Networking. Networking Hardware. Mark Handley

Networks and Operating Systems ( ) Chapter 10: I/O Subsystems (2)

Tutorial 2. Linux networking, sk_buff and stateless packet filtering. Roei Ben-Harush Check Point Software Technologies Ltd.

Our Small Quiz. Chapter 9: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem.

CSCI-GA Operating Systems. Networking. Hubertus Franke

Our Small Quiz. Chapter 10: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem. The I/O Subsystem

libnetfilter_log Reference Manual

- Knowledge of basic computer architecture and organization, ECE 445

Using Time Division Multiplexing to support Real-time Networking on Ethernet

ADRIAN PERRIG & TORSTEN HOEFLER ( ) 10: I/O

What is an L3 Master Device?

SpiNNaker Application Programming Interface (API)

Introduction to Oracle VM (Xen) Networking

Advanced Computer Networks. End Host Optimization

Lecture 8. Network Layer (cont d) Network Layer 1-1

Interprocess Communication Mechanisms

shared storage These mechanisms have already been covered. examples: shared virtual memory message based signals

ECE 650 Systems Programming & Engineering. Spring 2018

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

A Client-Server Exchange

Review: Hardware user/kernel boundary

Network device drivers in Linux

Ref: A. Leon Garcia and I. Widjaja, Communication Networks, 2 nd Ed. McGraw Hill, 2006 Latest update of this lecture was on

Asynchronous Events on Linux

Packet Sniffing and Spoofing

Xen Network I/O Performance Analysis and Opportunities for Improvement

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

440GX Application Note

Netfilter & Packet Dropping

Motivation of VPN! Overview! VPN addressing and routing! Two basic techniques for VPN! ! How to guarantee privacy of network traffic?!

UNIX Sockets. Developed for the Azera Group By: Joseph D. Fournier B.Sc.E.E., M.Sc.E.E.

Question Score 1 / 19 2 / 19 3 / 16 4 / 29 5 / 17 Total / 100

Message Passing Architecture in Intra-Cluster Communication

Hybrid of client-server and P2P. Pure P2P Architecture. App-layer Protocols. Communicating Processes. Transport Service Requirements

Memory-Mapped Files. generic interface: vaddr mmap(file descriptor,fileoffset,length) munmap(vaddr,length)

Outline. 1) Introduction to Linux Kernel 2) How system calls work 3) Kernel-space programming 4) Networking in kernel 2/34

Some of the slides borrowed from the book Computer Security: A Hands on Approach by Wenliang Du. Firewalls. Chester Rebeiro IIT Madras

CSC 474/574 Information Systems Security

Overview. Last Lecture. This Lecture. Daemon processes and advanced I/O functions

Support for Smart NICs. Ian Pratt

TCP/IP Stack Introduction: Looking Under the Hood!

Real-Time Networking for Quality of Service on TDM based Ethernet

Oral. Total. Dated Sign (2) (5) (3) (2)

CSE 153 Design of Operating Systems

Overview. This Lecture. Interrupts and exceptions Source: ULK ch 4, ELDD ch1, ch2 & ch4. COSC440 Lecture 3: Interrupts 1

ECE4110 Internetwork Programming. Introduction and Overview

Networking in a Vertically Scaled World

Implementing the Wireless Token Ring Protocol As a Linux Kernel Module

CS 351 Week 15. Course Review

Linux Kernel Application Interface

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

What s an API? Do we need standardization?

Distributed Real-Time Control Systems. Module 26 Sockets

Group-A Assignment No. 6

NETWORK PROGRAMMING. Instructor: Junaid Tariq, Lecturer, Department of Computer Science

Transport Layer. The transport layer is responsible for the delivery of a message from one process to another. RSManiaol

Chapter 5.6 Network and Multiplayer

Chapter 13: I/O Systems

Lecture 3. The Network Layer (cont d) Network Layer 1-1

rx hardening & udp gso willem de bruijn

Introduction to Internetworking

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Chapter 13: I/O Systems

Design Overview of the FreeBSD Kernel CIS 657

Operating Systems Design Exam 3 Review: Spring 2011

Design Overview of the FreeBSD Kernel. Organization of the Kernel. What Code is Machine Independent?

Linux Operating System

6.9. Communicating to the Outside World: Cluster Networking

EVASIVE INTERNET PROTOCOL: END TO END PERFORMANCE

Much Faster Networking

netfilters connection tracking subsystem

System Interconnect Software Programming Interface

Packet Aggregation in Linux

CS 326: Operating Systems. Networking. Lecture 17

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Concurrent Architectures - Unix: Sockets, Select & Signals

IPv4 and ipv6 INTEROPERABILITY

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

A practical introduction to XDP

Silberschatz and Galvin Chapter 12

1-1. Switching Networks (Fall 2010) EE 586 Communication and. October 25, Lecture 24

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

IPtables and Netfilter

Open Source Traffic Analyzer

CS 5460/6460 Operating Systems

Network and Security: Introduction

Lecture Topics. Announcements. Today: Operating System Overview (Stallings, chapter , ) Next: Processes (Stallings, chapter

Transcription:

The Linux Network Subsystem Unable to handle kernel paging request at virtual address 4d1b65e8 Unable Covers to handle Linux kernel paging version request at virtual 2.6.25 address 4d1b65e8 pgd = c0280000 pgd = c0280000 Version 1.1 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>] lr : [<bf00901c>] Not tainted sp : c076df78 ip : 60000093 fp : c076df84 pc : [<c00116c8>] lr : [<bf00901c>] Not tainted 1

Rights to copy Attribution ShareAlike 2.0 You are free to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work Under the following conditions Attribution. You must give the original author credit. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. License text: http://creativecommons.org/licenses/by sa/2.0/legalcode This kit contains work by the following authors: Copyright 2004 2006 Michael Opdenacker michael@free electrons.com http://www.free electrons.com Copyright 2003 2006 Oron Peled oron@actcom.co.il http://www.actcom.co.il/~oron Copyright 2004 2008 Codefidence ltd. info@codefidence.com http://www.codefidence.com 2

What is Linux? Linux is a kernel that implements the POSIX and Single Unix Specification standards which is developed as an Open Source project. When one talks of installing Linux, one is referring to a Linux Distribution: a combination of Linux and other programs and library that form an operating system. Linux runs on 24 main platforms and supports applications ranging from ccnuma super clusters to cellular phones and micro controllers. Linux is 15 years old, but is based on the 40 years old Unix design philosophy 3

Layers in a Linux system Kernel Kernel Modules User programs C library System libraries Application libraries User programs Kernel C library 4

Kernel architecture App1 App2... C library User space System call interface Process management Memory management Filesystem support Filesystem types Device control Networking Kernel space CPU support code CPU / MMU support code Storage drivers Character device drivers Network device drivers Hardware CPU RAM Storage 5

Kernel Mode vs. User Mode All modern CPUs support a dual mode of operation: User mode, for regular tasks. Supervisor (or privileged) mode, for the kernel. The mode the CPU is in determines which instructions the CPU is willing to execute: Sensitive instructions will not be executed when the CPU is in user mode. The CPU mode is determined by one of the CPU registers, which stores the current Ring Level 0 for supervisor mode, 3 for user mode, 1 2 unused by Linux. 6

The System Call Interface When a user space tasks needs to use a kernel service, it will make a System Call. The C library places parameters and number of system call in registers and then issues a special trap instruction. The trap atomically changes the ring level to supervisor mode and the sets the instruction pointer to the kernel. The kernel will find the required system called via the system call table and execute it. Returning from the system call does not require a special instruction, since in supervisor mode the ring level can be changed directly. 7

Linux System Call Path Kernel do_name() sys_name() Function call Trap entry.s Task Glibc Task 8

Linux networking Subsystem Overview Stack <> App App 1 App2 App3 Socket Layer UDP TCP ICMP Networking Stack Driver <> Stack Driver <> Hardware IP Stack Driver Interface Driver Bridge 9

Network Device Driver Hardware Interface packet packet packet packet packet Tx Rx Send Free Send Free Send RcvOk SentOK RcvErr SendErr RecvCRC Free RcvOK Memory Access Memory mapped registers access Driver Interrupts packet packet packet packet Driver allocates Ring Buffers. Driver resets descriptors to initial state. Driver puts packet to be sent in Tx buffers. Device puts received packet in Rx buffers. Driver/Device update descriptors to indicate state. Device indicates Rx and end of Tx with interrupt, unless interrupt mitigation techniques are applied. DMA 10

Network Device Registration Each network device is represented by a struct net_device These are allocated using: struct net_device *alloc_netdev(size, mask, setup_func); size size of our priv data part mask a naming pattern (e.g. eth%d ) setup_func A functionthat set ups the rest of net_device. And is registered via a call to: int register_netdev(struct net_device *dev); 11

Network Device Initialization The net_device structure is initalized with numerous methods and flags by the setup function: open request resources, register interrupts, start queues. stop deallocates resources, unregister irq, stop queue. get_stats report statistics set_multicast_list configure device for multicast hard_start_xmit called by the stack to initiate Tx. IFF_MULTICAST Device support multicast IFF_NOARP Device does not support ARP protocol 12

Packet Representation We need to manipulate packets through the stack This manipulation involves efficiently: Adding protocol headers/trailers down the stack. Removing protocol headers/trailers up the stack. Packets can be chained together. Each protocol should have convenient access to header fields. To do all this the kernel uses the sk_buff structure. 13

Socket Buffers The sk_buff structure represents a single packet. This structure is passed through the protocol stack. It holds pointers to a buffers with the packet data. It holds many type of other information: Data size. Incoming device. Priority. Security... 14

struct sk_buff next: Next buffer in list prev: Previous buffer in list sk: Socket we are owned by tstamp: Time we arrived dev: Device we arrived on/are leaving by input_dev: Device we arrived on h: Transport layer header nh: Network layer header mac: Link layer header dst: Destination route cache entry sp: Security path, used for xfrm cb: Control buffer. Private data. len: Length of actual data data_len: Data length mac_len: Length of link layer header csum: Checksum local_df: Allow local fragmentation flag cloned: Head may be cloned (see refcnt) nohdr: Payload reference only flag pkt_type: Packet class fclone: Clone status ip_summed: Driver fed us an IP checksum priority: Packet queuing priority users: User count see {datagram,tcp}.c protocol: Packet protocol from driver truesize: Buffer size head: Head of buffer data: Data head pointer tail: Tail pointer end: End pointer destructor: Destruct function nfmark: Netfilter hooks private data nfct: Associated connection, if any ipvs_property: skbuff is owned by ipvs nfctinfo: Connection tracking info. nfct_reasm: Netfilter conntrack re assembly pointer nf_bridge: Saved data about a bridged frame tc_index: Traffic control index tc_verd: Traffic control verdict dma_cookie: DMA operation cookie secmark: Security marking for LSM 15

Socket Buffer Diagram headroom frag1 Note len... head data tail end... dev Ethernet IP TCP Payload Padding struct sk_shared_info frag2 frag3 Network chip must support Scatter/Gather to use of frags. Otherwise kernel must copy buffers before send! struct sk_buff 16

Socket Buffer Operations skb_put: add data to a buffer. skb_push: add data to the start of a buffer. skb_pull: remove data from the start of a buffer. skb_headroom: returns free bytes at buffer head. skb_tailroom: returns free bytes at buffer end. skb_reserve: adjust headroom. skb_trim: remove end from a buffer. 17

Operation Example: skb_put unsigned char *skb_put (struct sk_buff * skb, unsigned int len) Adds data to a buffer: skb: buffer to use len: amount of data to add This function extends the used data area of the buffer. If this would exceed the total buffer size the kernel will panic. A pointer to the first byte of the extra data is returned. 18

Socket Buffer Alignment CPUs often take a performance hit when accessing unaligned memory locations. Since an Ethernet header is 14 bytes network drivers often end up with the IP header at an unaligned offset. The IP header can be aligned by shifting the start of the packet by 2 bytes. Drivers should do this with: skb_reserve(net_ip_align); The downside is that the DMA is now unaligned. On some architectures the cost of an unaligned DMA outweighs the gains so NET_IP_ALIGN is set on a per arch basis. 19

Socket Buffer Padding The networking layer reserves some headroom in skb data. This is used to avoid having to reallocate skb data when the header has to grow. In the default case, if the header has to grow 16 bytes or less we avoid the reallocation. Unfortunately, this headroom changes the DMA alignment of the resulting network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive on some architectures. Therefore architecture can override this value, as long as at least 16 bytes of free headroom are there. 20

Socket Buffer Allocations dev_alloc_skb: allocate an skbuff for Rx netdev_alloc_skb: allocate an skbuff for Rx, on a specific device. Allocate a new sk_buff and assign it a usage count of one. The buffer has unspecified headroom built in. Users should allocate the headroom they think they need without accounting for the built in space. The built in space is used for optimizations NULL is returned if there is no free memory. Although these functions allocates memory it can be called from an interrupt. 21

sk_buff Allocation Example Immediately after allocation, we should reserve the needed headroom: struct sk_buff*skb; skb = dev_alloc_skb(1500); if(unlikely(!skb)) break; /* Mark as being used by this device */ skb >dev = dev; /* Align IP on 16 byte boundaries */ skb_reserve(skb, NET_IP_ALIGN); 22

Softnet Network stack is implemented as a pair of softirqs for parallelize packet handling on SMP machines: NET_TX_SOFTIRQ Feeds packets from network stack to driver. NET_RX_SOFTIRQ Feeds packets from driver to network stack. Like any other softirq, these are called on return from interrupt or via the low priority ksoftirqd kernel thread. Transmit/receive queues are stored in per cpu softnet_data. 23

Linux Contexts Interrupt Handlers Interrupt Context Hi prio tasklets SoftIRQs Net Stack Timers... Regular tasklets Kernel Space Network Interface Device Driver User Context User Space Process Thread Kernel Thread 24

Packet Reception The driver allocates an skb and sets up a descriptor in the ring buffers for the hardware. The driver Rx interrupt handler calls netif_rx(skb). netif_rx deposits the sk_buff in the per cpu input queue. and marks the NET_RX_SOFTIRQ to run. At SoftIRQ processing time, net_rx_action() is called by NET_RX_SOFTIRQ, which calls the driver poll() method to feed the packet up. Normally poll() is set to proccess_backlog() by net_dev_init(). 25

Packet Rx Overview 26

Packet Transmission Each network device defines a method: int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); This function is indirectly called from the NET_TX_SOFTIRQ Call are serialized via the lock dev >xmit_lock_owner The driver manages the transmit queue during interface up and downs or to signal back pressure using the following functions: void netif_start_queue(struct net_device *net); void netif_stop_queue(struct net_device *net); void netif_wake_queue(struct net_device *net); 27

Packet Tx Overview 28

NAPI Network New API Provides interrupt mitigation Requirements: A DMA ring buffer. Ability to turn off receive interrupts or events. It is used by defining a new method: int (*poll) (struct net_device *dev, int * budget); which is called by the network stack periodically when signaled by the driver to do so. 29

NAPI (cont.) When a receive interrupt occurs, driver: Turns off receive interrupts. Calls netif_rx_schedule(dev) to get stack to start calling it's poll method. The Poll method Scans receive ring buffers, feeding packets to the stack via: netif_receive_skb(skb). If work finished within budget parameter, re enables interrupts and calls netif_rx_complete(dev) Else, stack will call poll method again. 30

Routing After the socket buffer is delivered to a protocol handler the handler may decide to route the packet. The default routing uses the normal destination based routing with single table and a FIB destination cache. For each packet the routintg destination is looked up in the FIB cache. If found, the packet is sent to that interface driver. Otherwise a more costly routing decision based on rules occurs and the result is stored in the FIB. 31

What is Netfilter? Netfilter is a framework for packet mangling Each protocol defines "hooks" (IPv4 defines 5) which are well defined points in a packet's traversal of that protocol stack. At each of these points, the protocol will call the netfilter framework with the packet and the hook number. Parts of the kernel can register to listen to the different hooks for each protocol. When a packet is passed to the netfilter framework, it will call all registered callbacks for that hook and protocol. 32

Netfilter Architecture Ingres Pre Routing Route Forward Post Routing Egres Route Local In Local Out Local Sockets 33

Netfilter Hook Kernel code can register a call back function to be called when a packet arrives at each hook. and are free to manipulate the packet. The callback can then tell netfilter to do one of five things: NF_ACCEPT: continue traversal as normal. NF_DROP: drop the packet; don't continue traversal. NF_STOLEN: I've taken over the packet; stop traversal. NF_QUEUE: queue the packet (usually for userspace handling). NF_REPEAT: call this hook again. 34

IP Tables A packet selection system called IP Tables has been built over the netfilter framework. It is a direct descendant of ipchains (that came from ipfwadm, that came from BSD's ipfw IIRC), with extensibility. Kernel modules can register a new table, and ask for a packet to traverse a given table. This packet selection method is used for packet filtering (the `filter' table), Network Address Translation (the `nat' table) and general pre route packet mangling (the `mangle' table). 35

IP Tables and Netfilter Hooks Ingres Pre Routing Route Forward Post Routing Egres Conntrack Mangle Destination NAT Mangle Filter Route Conntrack Mangle Source NAT Filter Conntrack Mangle Local In Local Out Conntrack Mangle Destination NAT Filter Local Sockets 36

BSD Sockets Interface User space network interface: socket() / bind() / accept() / listen() Initalization, addressing and hand shaking select() / poll() / epoll() Waiting for events send() / recv() Stream oriented (e.g. TCP) Rx / Tx sendto() / recvfrom() Datagram oriented (e.g. UDP) Rx / TX 37

Simple Client/Server Clients socket s; char buf[256]; s =socket() connect(s, IP:port) while(ret!=0) ret = recv(s, buf) Server socket s 1, s 2... s n ; char buf[256]; s =socket() bind(s 1, IP:port) listen(s 1 ) while { select(s 1,s 2... s n ) if(s1) s n = accept(s 1 ) else while(ret!=0) ret = send(s n, buf) } 38

Simple Client/Server Copies Kernel Client Rx Tx Server Kernel Copy to user Copy from user... ret = recv(s, buf)...... ret = send(s, buf)... User space Application User space Application 39

BSD Sockets Interface Properties Originally developed by UC Berkeley research at the dawn of time Used by 90% of network oriented programs Context switch for every Rx/Tx Buffer copied from/to user space to/from kernel Standart interface across operating systems Simple, well understood by programmers 40

Zero Copy In kernel buffer that the user has control over. The buffer is implemented as a set of reference counted pointers which the kernel copies around without actually copying the data. splice() moves data to/from the buffer from/to an arbitrary file descriptor tee() Moves data to/from one buffer to another vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file. Can be used anywhere where a process needs to send something from one end to another, but it doesn't need to touch or even look at the data, just forward it. 41

Zero Copy of Example 1 Splice() * User space Only pointer is copied File Pointer to page cache page Data Socket Buf Pointer to page as part of frag list Kernel Memory HD Controller Copy (using DMA) Network Chip Hardware * In relaity you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers. 42

Zero Copy of Example 2 Mem write VMSplice() * User space Proccess page tables Only pointer is copied skb Pointer to page as part of frag list Kernel Memory Data Copy (using DMA) Network Chip Hardware * In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers. 43

Hardware Offloading Large receive offload supported (in software) TCP / Large Segment Offload supported (e.g. e1000 driver) No TCP Offload Engine support Security updates Point in time solution Different network behavior Hardware specific limits and resource based denial of service attacks http://www.linux foundation.org/en/net:toe 44

More Information Linux Foundation Net:Kernel Flow http://www.linuxfoundation.org/en/net:kernel_flow Zero Copy I: User Mode Perspective http://www.linuxjournal.com/article/6345 Understanding Linux Network Internals, O'Reilly Media 45

Use the Source, Luke! Many resources and tricks on the Internet find you will, but solutions to all technical issues only in the Source lie. Thanks to LucasArts 46

Copyrights and Trademarks Copyright 2004 2008 Codefidence Ltd. Tux Image Copyright: 1996 Larry Ewing Linux is a registered trademark of Linus Torvalds. All other trademarks are property of their respective owners. Used and distributed under a 47