The Network Stack (1)

Similar documents
L41 - Lecture 5: The Network Stack (1)

The Network Stack (2)

FreeBSD Network Stack Optimizations for Modern Hardware

L41: Lab 4 - The TCP State Machine

Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack

L41: Lab 4 - The TCP State Machine

L41: Lab 4 The TCP State Machine. Lecturelet 4 Dr Robert N. M. Watson and Dr Graeme Jenkinson 2 February 2018

Advanced Computer Networks. End Host Optimization

UDP and TCP. Introduction. So far we have studied some data link layer protocols such as PPP which are responsible for getting data

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

User Datagram Protocol

CSCI-GA Operating Systems. Networking. Hubertus Franke

TCP for OMNeT++ Roland Bless Mark Doll. Institute of Telematics University of Karlsruhe, Germany. Bless/Doll WSC

ECE 650 Systems Programming & Engineering. Spring 2018

Computer Network Programming. The Transport Layer. Dr. Sam Hsu Computer Science & Engineering Florida Atlantic University

Introduction to Networking. Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

NT1210 Introduction to Networking. Unit 10

UDP, TCP, IP multicast

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Chapter 10: I/O Subsystems (2)

Light & NOS. Dan Li Tsinghua University

Chapter 10: I/O Subsystems (2)

Network Implementation

CCNA 1 Chapter 7 v5.0 Exam Answers 2013

CSE/EE 461 Lecture 13 Connections and Fragmentation. TCP Connection Management

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Network stack specialization for performance

Zero-Copy Socket Splicing

Internet Layers. Physical Layer. Application. Application. Transport. Transport. Network. Network. Network. Network. Link. Link. Link.

SMPng Network Stack Update

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

IO-Lite: A Unified I/O Buffering and Caching System

Software Routers: NetMap

Networking and Internetworking 1

19: Networking. Networking Hardware. Mark Handley

08:End-host Optimizations. Advanced Computer Networks

Application. Transport. Network. Link. Physical

Concurrent and distributed systems Lecture 8: Concurrent systems case study. Dr Robert N. M. Watson. Kernel concurrency

CSE 461 The Transport Layer

Introduction to TCP/IP networking

CSEP 561 Connections. David Wetherall

CSEP 561 Connections. David Wetherall

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Our Small Quiz. Chapter 9: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem.

Optimizing Performance: Intel Network Adapters User Guide

Networks and Operating Systems ( ) Chapter 10: I/O Subsystems (2)

CSE 461 Connections. David Wetherall

Much Faster Networking

Our Small Quiz. Chapter 10: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem. The I/O Subsystem

NetBSD Kernel Topics:

Connections. Topics. Focus. Presentation Session. Application. Data Link. Transport. Physical. Network

Introduction to TCP/IP Offload Engine (TOE)

Transport Layer (TCP/UDP)

Lecture 20 Overview. Last Lecture. This Lecture. Next Lecture. Transport Control Protocol (1) Transport Control Protocol (2) Source: chapters 23, 24

NWEN 243. Networked Applications. Layer 4 TCP and UDP

Transport layer. UDP: User Datagram Protocol [RFC 768] Review principles: Instantiation in the Internet UDP TCP

Transport layer. Review principles: Instantiation in the Internet UDP TCP. Reliable data transfer Flow control Congestion control

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Network Layer (1) Networked Systems 3 Lecture 8

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

NETWORK PROGRAMMING. Instructor: Junaid Tariq, Lecturer, Department of Computer Science

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

TPF 4.1 Communications - TCP/IP Enhancements

Demultiplexing on the ATM Adapter: Experiments withinternetprotocolsinuserspace

URDMA: RDMA VERBS OVER DPDK

What s an API? Do we need standardization?

Lecture 13: Transportation layer

6.9. Communicating to the Outside World: Cluster Networking

ADRIAN PERRIG & TORSTEN HOEFLER ( ) 10: I/O

CS4700/CS5700 Fundamentals of Computer Networks

CCNA R&S: Introduction to Networks. Chapter 7: The Transport Layer

9th Slide Set Computer Networks

Lecture 11: IP routing, IP protocols

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

TSIN02 - Internetworking

Signaled Receiver Processing

Unix Network Programming

Lecture 11. Transport Layer (cont d) Transport Layer 1

TSIN02 - Internetworking

Transport: How Applications Communicate

Information Network 1 TCP 1/2. Youki Kadobayashi NAIST

440GX Application Note

Operating Systems and Networks. Network Lecture 8: Transport Layer. Adrian Perrig Network Security Group ETH Zürich

Operating Systems and Networks. Network Lecture 8: Transport Layer. Where we are in the Course. Recall. Transport Layer Services.

6WINDGate. White Paper. Packet Processing Software for Wireless Infrastructure

QUIZ: Longest Matching Prefix

CompSci 356: Computer Network Architectures. Lecture 8: Spanning Tree Algorithm and Basic Internetworking Ch & 3.2. Xiaowei Yang

TCP Basics : Computer Networking. Overview. What s Different From Link Layers? Introduction to TCP. TCP reliability Assigned reading

libvnf: building VNFs made easy

Transport protocols. Transport Layer 3-1

Mike Anderson. TCP/IP in Embedded Systems. CTO/Chief Scientist The PTR Group, Inc.

Chapter 10: Networking Advanced Operating Systems ( L)

RDMA and Hardware Support

Message Passing Models and Multicomputer distributed system LECTURE 7

TCP : Fundamentals of Computer Networks Bill Nace

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Enabling Fast, Dynamic Network Processing with ClickOS

An FPGA-Based Optical IOH Architecture for Embedded System

Transcription:

The Network Stack (1) L41 Lecture 5 Dr Robert N. M. Watson 25 January 2017 Reminder: where we left off last term Long, long ago, but in a galaxy not so far away: Lecture 3: The Process Model (1) Lecture 4: The Process Model (2) Lab 2: IPC buffer size and the probe effect Lab 3: Micro-architectural effects of IPC Explored several implied (and rejected) hypotheses: Larger IO/IPC buffer sizes amortise system-call overhead A purely architectural (SW) review dominates The probe effect doesn t matter in real-world workloads 1

This time: Introduction to Network Stacks Rapid tour across hardware and software: Networking and the sockets API Network-stack design principles: 1980s vs. today Memory flow across hardware and software Network-stack construction and work flows Recent network-stack research Networking: A key OS function (1) Communication between computer systems Local-Area Networks (LANs) Wide-Area Networks (WANs) A network stack provides: Sockets API and extensions Interoperable, feature-rich, high-performance protocol implementations (e.g., IPv4, IPv6, ICMP, UDP, TCP, SCTP, ) Security functions (e.g., cryptographic tunneling, firewalls...) Device drivers for Network Interface Cards (s) Monitoring and management interfaces (BPF, ioctl) Plethora of support libraries (e.g., DNS) 2

Networking: A key OS function (2) Dramatic changes over 30 years: 1980s: Early packet-switched networks, UDP+TCP/IP, Ethernet 1990s: Large-scale migration to IP; Ethernet VLANs 2000s: 1-Gigabit, then 10-Gigabit Ethernet; 802.11; GSM data 2010s: Large-scale deployment of IPv6; 40/100-Gbps Ethernet... billions trillians of devices? Vanishing technologies UUCP, IPX/SPX, ATM, token ring, SLIP,... The Berkeley Sockets API (1983) close() read() write()... accept() bind() connect() getsockopt() listen() recv() select() send() setsockopt() socket()... The Design and Implementation of the 4.3BSD Operating System (but APIs/code first appeared in 4.2BSD) Now universal TCP/IP (POSIX, Windows) Kernel-resident network stack serves networking applications via system calls Reuses file-descriptor abstraction Same API for local and distributed IPC Simple, synchronous, copying semantics Blocking/non-blocking I/O, select() Multi-protocol (e.g., IPv4, IPv6, ISO, ) TCP-focused but not TCP-specific Cross-protocal abstractions and libraries Protocol-specific implementations Portable applications 3

BSD network-stack principles (1980s-1990s) Multi-protocol, packet-oriented network research framework: Object-oriented: multiple protocols, socket types, but one API Protocol-independent: streams vs. datagrams, sockets, socket buffers, socket addresses, network interfaces, routing table, packets Protocol-specific: connection lists, address/routing specialization, routing, transport protocol itself encapsulation, decapsulation, etc. Packet-oriented: Packets and packet queueing as fundamental primitives If there is a failure (overload, corruption), drop the packet Work hard to maintain packet source ordering Differentiate receive from deliver and send from transmit Heavy focus on TCP functionality and performance Middle-node (forwarding), not just edge-node (I/O), functionality High-performance packet capture: Berkeley Packet Filter (BPF) FreeBSD network-stack principles (1990s-2010s) All of the 1980s features and also Hardware: Multi-processor scalability offload features (checksums, TSO/LRO, full TCP) Multi-queue network cards with load balancing/flow direction Performance to 10s or 100s of Gigabit/s Wireless networking Protocols: Dual IPv4/IPv6 Security/privacy: firewalls, IPSec,... Software model: Flexible memory model integrates with VM for zero-copy Network-stack virtualisation Userspace networking via netmap 4

Memory flow in hardware CPU CPU Ethernet 32K, 3-4 cycles L1 Cache L1 Cache PCI 256K, 8-12 cycles 25M 32-40 cycles DRAM up to 256-290 cycles L2 Cache Last-Level Cache (LLC) DRAM DDIO Key idea: follow the memory Historically, memory copying avoided due to instruction count Today, memory copying avoided due to cache footprint Recent Intel CPUs push and pull DMA via the LLC ( DDIO ) If we differentiate send and transmit, is this a good idea? it depends on the latency between DMA and processing. Memory flow in software User process recv( ) copyout() send( ) copyin() Kernel Socket/protocol deliver receive free() alloc() network memory allocator alloc() free() Socket/protocol send transmit DMA receive DMA transmit Socket API implies one software-driven copy to/from user memory Historically, zero-copy VM tricks for socket API ineffective Network buffers cycle through the slab allocator Receive: allocate in driver, free in socket layer Transmit: allocate in socket layer, free in driver DMA performs second copy; can affect cache/memory bandwidth NB: what if packet-buffer working set is larger than the cache? 5

The mbuf abstraction socket buffer netisr work queue TCP reassembly queue network interface queue m_data m_len data current data struct mbuf mbuf header packet header data external storage pad m_data mbuf packet queue m_len VM/ buffercache page current data mbuf mbuf chain mbuf mbuf Unit of work allocation and distribution throughout the stack mbuf chains represent in-flight packets, streams, etc. Operations: alloc, free, prepend, append, truncate, enqueue, dequeue Internal or external data buffer (e.g., VM page) Reflects bi-modal packet-size distribution (e.g., TCP ACKs vs data) Similar structures in other OSes e.g., skbuff in Linux Send/receive paths in the network stack Application recv() send() System call layer recv() send() Socket layer soreceive() sbappend() sosend() sbappend() TCP layer tcp_reass() tcp_input() tcp_send() tcp_output() IP layer ip_input() ip_output() Link layer ether_input() ether_output() Device driver em_intr() em_start() em_entr() 6

Forwarding path in the network stack IP layer ip_forward() ip_input() ip_output() Link layer ether_input() ether_output() Device driver em_intr() em_start() em_entr() Work dispatch: input path Hardware Kernel Userspace Device Linker layer + driver IP TCP + Socket Socket Application Receive, validate checksum Interpret and strips link layer header Validate checksum, strip IP header Validate checksum, strip TCP header Look up socket Reassemble segments, deliver to socket Kernel copies out mbufs + clusters Data stream to application netisr dispatch ithread netisr software ithread user thread Direct dispatch ithread Deferred dispatch: ithread netisr thread user thread Direct dispatch: ithread user thread Pros: reduced latency, better cache locality, drop early on overload Cons: reduced parallelism and work placement opportunities user thread 7

Work dispatch: output path Userspace Kernel Hardware Application Socket TCP IP Link layer + driver Device 256 2k, 4k, 9k, 16k MSS MSS Data stream from application Kernel copies in data to mbufs + clusters TCP segmentation, header encapsulation, checksum IP header encapsulation, checksum Ethernet frame encapsulation, insert in descriptor ring Checksum + transmit user thread ithread Fewer deferred dispatch opportunities implemented (Deferred dispatch on device-driver handoff in new iflib KPIs) Gradual shift of work from software to hardware Checksum calculation, segmentation, Work dispatch: TOE input path Hardware Kernel Userspace Device Driver + socket layer Application Receive, validate ethernet, IP, TCP checksums Interpret and strips link layer header Strip IP header Strip TCP Look up header and deliver to socket Reassemble segments Kernel copies out mbufs + clusters Data stream to application ithread user thread Move majority of TCP processing through socket deliver to hardware Kernel provides socket buffers and resource allocation Remainder, including state, retransmissions, etc., in But: two network stacks? Less flexible/updateable structure? Better with an explicit HW/SW architecture e.g., Microsoft Chimney 8

Netmap: a novel framework for fast packet I/O Luigi Rizzo, USENIX ATC 2012 (best paper). registers head tail User process Kernel driver Hardware ring phys_addr len Protocol DMA receive Operating system Buffers mbufs v_addr DMA transmit Map buffers directly into user process memory Zero copy to/from application System calls initiate DMA, block for events Packets can be reinjected into normal stack Ships in FreeBSD; patch available for Linux Userspace network stack can be specialised to task (e.g., packet forwarding) Network stack specialisation for performance Ilias Marinos, Robert N. M. Watson, Mark Handley, SIGCOMM 2014. Throughput (Gbps) CPU utilization (%) 60 40 20 0 100 80 60 40 20 0 Sandstorm nginx + FreeBSD nginx + Linux 4 8 16 24 32 64 128 256 512 756 1024 File size (KB) Sandstorm nginx + FreeBSD nginx + Linux 4 8 16 24 32 64 128 256 512 756 1024 File size (KB) 30 years since the networkstack design developed Massive changes in architecture, microarchitecture, memory Optimising compilers Cache-centered CPUs Multiprocessing, NUMA DMA, multiqueue 10 Gigabit/s Ethernet Performance lost to generality throughout stack Revisit fundamentals through clean-slate stack Orders-of-magnitude performance gains 9

Next time: Socket buffers and TCP September 1981 Transmission Control Protocol Functional Specification ---------\ active OPEN +---------+ CLOSED \ ----------- +---------+<---------\ create TCB \ ^ snd SYN \ \ \ passive OPEN CLOSE \ ------------ ---------- \ \ create TCB delete TCB \ \ V \ \ +---------+ CLOSE \ LISTEN ---------- +---------+ delete TCB rcv SYN SEND ----------- ------- V snd SYN,ACK \ snd SYN +---------+ +---------+ / <----------------- ------------------> SYN rcv SYN SYN RCVD <----------------------------------------------- SENT snd ACK ------------------ ------------------- +---------+ rcv ACK of SYN / rcv SYN,ACK +---------+ \ -------------- ----------- x snd ACK V V +---------+ CLOSE ------- ESTAB snd FIN +---------+ CLOSE rcv FIN V ------- ------- +---------+ snd FIN snd ACK / \ +---------+ FIN <----------------- ------------------> CLOSE WAIT-1 ------------------ WAIT +---------+ rcv FIN +---------+ \ rcv ACK ------- CLOSE of FIN -------------- snd ACK ------- V x V snd FIN V +---------+ +---------+ +---------+ FINWAIT-2 CLOSING LAST-ACK +---------+ +---------+ +---------+ rcv ACK of FIN rcv ACK of FIN rcv FIN -------------- Timeout=2MSL -------------- ------- x V ------------ x V \ snd ACK +---------+delete TCB +---------+ ------------------------> TIME WAIT ------------------> CLOSED +---------+ +---------+ McKusick, et al: Chapter 14 (Transport-Layer Protocols) Transmission Control Protocol (TCP) TCP implementation Buffers and input processing Parallelism and performance DoS resistance The final two labs TCP Connection State Diagram Figure 6. 10