L41 - Lecture 5: The Network Stack (1)

Similar documents
The Network Stack (1)

The Network Stack (2)

L41: Lab 4 - The TCP State Machine

FreeBSD Network Stack Optimizations for Modern Hardware

Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack

L41: Lab 4 - The TCP State Machine

L41: Lab 4 The TCP State Machine. Lecturelet 4 Dr Robert N. M. Watson and Dr Graeme Jenkinson 2 February 2018

UDP and TCP. Introduction. So far we have studied some data link layer protocols such as PPP which are responsible for getting data

Advanced Computer Networks. End Host Optimization

UDP, TCP, IP multicast

CSCI-GA Operating Systems. Networking. Hubertus Franke

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

NT1210 Introduction to Networking. Unit 10

User Datagram Protocol

TCP for OMNeT++ Roland Bless Mark Doll. Institute of Telematics University of Karlsruhe, Germany. Bless/Doll WSC

Introduction to Networking. Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

ECE 650 Systems Programming & Engineering. Spring 2018

Introduction to TCP/IP Offload Engine (TOE)

Network stack specialization for performance

Computer Network Programming. The Transport Layer. Dr. Sam Hsu Computer Science & Engineering Florida Atlantic University

Software Routers: NetMap

Light & NOS. Dan Li Tsinghua University

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Network Implementation

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

19: Networking. Networking Hardware. Mark Handley

CSE/EE 461 Lecture 13 Connections and Fragmentation. TCP Connection Management

Application. Transport. Network. Link. Physical

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Internet Layers. Physical Layer. Application. Application. Transport. Transport. Network. Network. Network. Network. Link. Link. Link.

CCNA 1 Chapter 7 v5.0 Exam Answers 2013

Chapter 10: I/O Subsystems (2)

Much Faster Networking

Chapter 10: I/O Subsystems (2)

SMPng Network Stack Update

Transport layer. UDP: User Datagram Protocol [RFC 768] Review principles: Instantiation in the Internet UDP TCP

Transport layer. Review principles: Instantiation in the Internet UDP TCP. Reliable data transfer Flow control Congestion control

Enabling Fast, Dynamic Network Processing with ClickOS

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Connections. Topics. Focus. Presentation Session. Application. Data Link. Transport. Physical. Network

CSEP 561 Connections. David Wetherall

Zero-Copy Socket Splicing

CSE 461 Connections. David Wetherall

An FPGA-Based Optical IOH Architecture for Embedded System

IO-Lite: A Unified I/O Buffering and Caching System

Network Design Considerations for Grid Computing

Optimizing Performance: Intel Network Adapters User Guide

440GX Application Note

CSE 461 The Transport Layer

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

Introduction to TCP/IP networking

CSEP 561 Connections. David Wetherall

Networking and Internetworking 1

Just enough TCP/IP. Protocol Overview. Connection Types in TCP/IP. Control Mechanisms. Borrowed from my ITS475/575 class the ITL

08:End-host Optimizations. Advanced Computer Networks

TSIN02 - Internetworking

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

NETWORK PROGRAMMING. Instructor: Junaid Tariq, Lecturer, Department of Computer Science

TPF 4.1 Communications - TCP/IP Enhancements

QUIZ: Longest Matching Prefix

Our Small Quiz. Chapter 9: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem.

6. The Transport Layer and protocols

CCNA R&S: Introduction to Networks. Chapter 7: The Transport Layer

6.9. Communicating to the Outside World: Cluster Networking

Operating Systems and. Computer Networks. Introduction to Computer Networks. Operating Systems and

9th Slide Set Computer Networks

Memory Management Strategies for Data Serving with RDMA

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

Concurrent and distributed systems Lecture 8: Concurrent systems case study. Dr Robert N. M. Watson. Kernel concurrency

TSIN02 - Internetworking

CS4700/CS5700 Fundamentals of Computer Networks

Lecture 20 Overview. Last Lecture. This Lecture. Next Lecture. Transport Control Protocol (1) Transport Control Protocol (2) Source: chapters 23, 24

TSIN02 - Internetworking

Unix Network Programming

Our Small Quiz. Chapter 10: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem. The I/O Subsystem

Transport Layer (TCP/UDP)

Mike Anderson. TCP/IP in Embedded Systems. CTO/Chief Scientist The PTR Group, Inc.

TCP : Fundamentals of Computer Networks Bill Nace

Transport Layer Review

Networks and Operating Systems ( ) Chapter 10: I/O Subsystems (2)

NWEN 243. Networked Applications. Layer 4 TCP and UDP

Virtualization, Xen and Denali

PLEASE READ CAREFULLY BEFORE YOU START

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

NetBSD Kernel Topics:

Transport: How Applications Communicate

Lecture 11. Transport Layer (cont d) Transport Layer 1

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

TSIN02 - Internetworking

URDMA: RDMA VERBS OVER DPDK

Evolution of the netmap architecture

Question Score 1 / 19 2 / 19 3 / 16 4 / 29 5 / 17 Total / 100

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Lecture 13: Transportation layer

IsoStack Highly Efficient Network Processing on Dedicated Cores

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Transcription:

L41 - Lecture 5: The Network Stack (1) Dr Robert N. M. Watson 27 April 2015 Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 1 / 19

Introduction Reminder: where we left off in Lent Term Long, long ago, but in a galaxy not so far away: Lecture 3: The Process Model (1) Lecture 4: The Process Model (2) Lab 1: I/O performance Lab 2: IPC buffer size and probe effect Lab 3: Micro-architectural effects of IPC Explore several (implied, and it turns out, incorrect) hypotheses: Larger I/O and IPC buffer sizes amortize system-call overheads, improving application performance Micro-architecture is irrelevant The probe effect doesn t matter in real workloads Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 2 / 19

Network-stack goals A key OS function: networking Communication between distributed computer systems Local-area networking (LANs) and wide-area networking (WANs) A network stack provides: Sockets API and extensions Interoperable, feature-rich, high-performance protocol implementations (e.g., IPv4, IPv6, ICMP, UDP, TCP, SCTP,...) Device drivers for Network Interface Cards (NICs) Monitoring and management interfaces (BPF, ioctl) Plethora of support libraries (e.g., DNS) Dramatic changes over 30 years: 1980s: Early packet-switched networks, UDP+TCP/IP, Ethernet 1990s: Large-scale migration to IP; Ethernet VLANs 2000s: 1-Gigabit/s, then 10-Gigabit/s Ethernet; 802.11, GSM data 2010s: Large-scale deployment of IPv6; 40/100-Gigabit/s Ethernet Vanishing technologies: UUCP, IPX/SPX, ATM, token ring, SLIP,... Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 3 / 19

Network-stack goals The Berkeley Sockets API close() read() write()... accept() bind() connect() getsockopt() listen() recv() select() send() setsockopt() socket()... Universal API for TCP/IP (POSIX, Windows,...) The Design and Implementation of the 4.3BSD Operating System (although appeared in 4.2) Kernel-resident network stack serving userspace networking applications via system calls Reuse file-descriptor abstraction Same API for local and distributed IPC Simple, synchronous, copying semantics Blocking/non-blocking I/O, select() Multi-protocol (e.g., IPv4, IPv6, ISO,...) TCP-focused but not TCP-specific Cross-protocol abstractions: protocol, socket address, stream, datagram,... NB: socket in BSD API is not the same as a socket in the TCP RFC Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 4 / 19

Network-stack goals Early BSD network-stack design principles Framework for network research Object-oriented: multiple protocols, multiple socket types, one API Protocol-independent: streams vs. datagrams, sockets, socket buffers, socket addresses, network interfaces, routing table, packets Protocol-specific: connection lists, address/routing specialization, routing, transport protocol itself encapsulation, decapsulation, etc. Fundamentally packet-oriented: Packets and packet queueing as fundamental primitives If there is a failure (overload, corruption) drop the packet Work hard to maintain packet source ordering Differentiate receive from deliver and send from transmit Heavy focus on TCP functionality and performance Middle-node (forwarding), not just edge-node (I/O), functionality High-performance packet capture (Berkeley Packet Filter (BPF)) Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 5 / 19

Network-stack goals FreeBSD network-stack design principles All of the 1980s features and also... Multi-processor scalability NIC offload features (checksums, TSO/LRO, full TCP) Multi-queue network cards with load balancing/flow direction Performance to 10s or 100s of Gigabit/s Dual-IPv4/IPv6 Security/privacy: firewalls, IPsec,... Flexible memory model integrates with VM for zero-copy Full network-stack virtualisation Userspace networking via netmap Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 6 / 19

Network-stack design Memory flow in hardware CPU CPU Ethernet L1 Cache L1 Cache PCI NIC L2 Cache DDIO Last-Level Cache (LLC) DRAM Key idea: follow the memory Historically, memory copying in stack avoided due to CPU cost Today, memory copying in stack avoided due to cache footprint Recent Intel CPUs push and pull DMA via the LLC ( DDIO ) NB: if we differentiate send and transmit, is this a good idea? Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 7 / 19

Network-stack design Memory flow in software User process recv( ) copyout() send( ) copyin() Kernel Socket/protocol deliver NIC receive free() alloc() network memory allocator alloc() free() Socket/protocol send NIC transmit NIC DMA receive DMA transmit Socket API implies one copy to/from user memory Historically, zero-copy VM tricks for socket API ineffective Network buffers cycle through the slab allocator Receive: allocate in NIC driver, free in socket layer Transmit: allocate in socket layer, free in NIC driver DMA performs second copy; can affect cache/memory bandwidth NB: what if packet-buffer working set is larger than the cache? Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 8 / 19

The mbuf abstraction Network-stack design socket buffer netisr work queue TCP reassembly queue network interface queue m_data m_len data current data struct mbuf mbuf header packet header data external storage pad m_data mbuf packet queue m_len VM/ buffercache page current data mbuf mbuf chain mbuf mbuf mbuf chains represent in-flight packets, streams, etc. Also a unit of work allocation throughout the stack mbufs reference an in-mbuf or external buffer (e.g., VM page) Bi-modal packet size distribution; e.g., TCP ACKs vs. data Common operations: prepend, append, truncate at front or end Similar abstractions in other OSes e.g., skbuff in Linux Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 9 / 19

Network-stack design Local send/receive paths in the network stack Application recv() send() System call layer recv() send() Socket layer soreceive() sbappend() sosend() sbappend() TCP layer tcp_reass() tcp_input() tcp_send() tcp_output() IP layer ip_input() ip_output() Link layer ether_input() ether_output() Device driver em_intr() em_start() em_entr() NIC Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 10 / 19

Network-stack design Forwarding path in the network stack IP layer ip_forward() ip_input() ip_output() Link layer ether_input() ether_output() Device driver em_intr() em_start() em_entr() NIC Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 11 / 19

Network-stack design Work dispatch: input path Hardware Kernel Userspace Device Linker layer + driver IP TCP + Socket Socket Application Receive, validate checksum Interpret and strips link layer header Validate checksum, strip IP header Validate checksum, strip TCP header Look up socket Reassemble segments, deliver to socket Kernel copies out mbufs + clusters Data stream to application netisr dispatch ithread netisr software ithread user thread Direct dispatch ithread user thread Deferred dispatch - ithread -> netisr thread -> user thread Now: direct dispatch - ithread -> user thread Pros: reduced latency, better cache locality, drop overload early Cons: reduced parallelism and work placement opportunities Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 12 / 19

Network-stack design Work dispatch: output path Userspace Kernel Hardware Application Socket TCP IP Link layer + driver Device 256 2k, 4k, 9k, 16k MSS MSS Data stream from application Kernel copies in data to mbufs + clusters TCP segmentation, header encapsulation, checksum IP header encapsulation, checksum Ethernet frame encapsulation, insert in descriptor ring Checksum + transmit user thread ithread Fewer deferred dispatch opportunities implemented Gradual shift of work from software to hardware Checksum calculation, segmentation,... But no fundamental changes to output path Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 13 / 19

Network-stack design Work dispatch: TOE input path Hardware Kernel Userspace Device Driver + socket layer Application Receive, validate ethernet, IP, TCP checksums Interpret and strips link layer header Strip IP header Strip TCP Look up header and deliver to socket Reassemble segments Kernel copies out mbufs + clusters Data stream to application ithread user thread Move majority of TCP processing through socket deliver to hardware Full TCP offload: kernel provides socket buffers and resource allocation Remainder, including state, retransmits, reassembly, etc, in NIC But: Two network stacks? Less flexible/updateable structure? Better with an explicit SW architecture e.g., Microsoft Chimney? Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 14 / 19

Current research netmap: a novel framework for fast packet I/O Luigi Rizzo, USENIX ATC 2012 (best paper). NIC registers head tail User process Kernel NIC NIC driver Hardware NIC ring phys_addr len Protocol DMA receive Operating system Buffers mbufs v_addr DMA transmit Map NIC buffers directly into user process memory Zero copy to application Userspace network stack can be specialized to task (e.g., packet forwarding) System calls initiate DMA, block for NIC events Packets can be reinjected into normal stack Ships in FreeBSD, patch available for Linux Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 15 / 19

Current research Network Stack Specialization for Performance Ilias Marinos, Robert N.M. Watson, Mark Handley, SIGCCOMM 2014. CPU utilization (%) Throughput (Gbps) 60 40 20 0 100 80 60 40 20 0 Sandstorm nginx + FreeBSD nginx + Linux 4 8 16 24 32 64 128 256 512 756 1024 File size (KB) Sandstorm nginx + FreeBSD nginx + Linux 4 8 16 24 32 64 128 256 512 756 1024 File size (KB) 30 years since current network-stack architecture principles developed Massive changes in compilers, architecture, micro-architecture, memory, buses, NICs Optimising compilers Cache-centered CPUs Multiprocessing, NUMA DMA, multiqueue 10 Gigabit/s Ethernet Revisit fundamentals through clean-slate stack Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 16 / 19

Conclusion Next time: Socket buffers and TCP September 1981 Transmission Control Protocol Functional Specification +---------+ ---------\ active OPEN CLOSED ----------- \ +---------+<---------\ create TCB \ ^ snd SYN \ \ passive OPEN CLOSE \ \ ------------ ---------- \ \ create TCB delete TCB \ \ V \ \ +---------+ CLOSE \ LISTEN ---------- +---------+ delete TCB rcv SYN SEND ----------- ------- V snd SYN,ACK / snd SYN +---------+ +---------+ \ <----------------- ------------------> SYN rcv SYN SYN RCVD <----------------------------------------------- SENT snd ACK ------------------ ------------------- +---------+ rcv ACK of SYN rcv SYN,ACK +---------+ \ / -------------- ----------- x snd ACK V V +---------+ CLOSE ------- ESTAB snd FIN +---------+ CLOSE rcv FIN V ------- ------- +---------+ snd FIN / snd ACK \ +---------+ FIN <----------------- ------------------> CLOSE WAIT-1 ------------------ WAIT +---------+ rcv FIN +---------+ \ rcv ACK ------- CLOSE of FIN -------------- snd ACK ------- V x V snd FIN V +---------+ +---------+ +---------+ FINWAIT-2 CLOSING LAST-ACK +---------+ +---------+ +---------+ rcv ACK of FIN rcv ACK of FIN rcv FIN -------------- Timeout=2MSL -------------- ------- x V ------------ x V \ snd ACK +---------+delete TCB +---------+ ------------------------> TIME WAIT ------------------> CLOSED +---------+ +---------+ McKusick, et al: Chapter 14 (Transport-Layer Protocols) The socket buffer abstraction A (very) little about the TCP implementation The TCP state machine TCP flow and congestion control The final two labs TCP Connection State Diagram Figure 6. Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 17 / 19

Time permitting Lab 1 - I/O - Buffer size vs. throughput Median KBytes/s bandwidth reading 16M file 180000 160000 140000 120000 100000 80000 60000 40000 20000./io-static -r -b <bytes> -t 16777216 /data/iofile unmodified benchmark bzero() buffer at start 32K L1 32x4K TLB entries 256K L2 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 bytes - buffer-size argument to read(2) Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 18 / 19

Time permitting Lab 1 - I/O - Static/dynamic linking vs. throughput Median Execution Time (us) 10 6 10 5 10 4 10 3 10 2 10 1 10 0 6 Ratio of Medians 5 4 3 2 1 Static vs. Dynamic Binary Performance by File Size Fixed Buffer Size (2M) Dynamic Static Dynamic/Static Ratio of 1.0: dynamic = static Ratio falls below 1.05 at roughly 8MB file, and 1.01 at roughly 32M 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 File Size in Bytes Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 19 / 19