Networking Subsystem in Linux. Manoj Naik IBM Almaden Research Center

Similar documents
Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Tutorial 2. Linux networking, sk_buff and stateless packet filtering. Roei Ben-Harush Check Point Software Technologies Ltd.

Linux IP Networking. Antonio Salueña

jelly-near jelly-far

19: Networking. Networking Hardware. Mark Handley

What s an API? Do we need standardization?

UNIT IV- SOCKETS Part A

A Socket Example. Haris Andrianakis & Angelos Stavrou George Mason University

Chapter 10: I/O Subsystems (2)

NETWORK PROGRAMMING. Instructor: Junaid Tariq, Lecturer, Department of Computer Science

Chapter 10: I/O Subsystems (2)

UDP CONNECT TO A SERVER

Motivation of VPN! Overview! VPN addressing and routing! Two basic techniques for VPN! ! How to guarantee privacy of network traffic?!

Interprocess Communication. Interprocess Communication

Our Small Quiz. Chapter 9: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem.

Session NM056. Programming TCP/IP with Sockets. Geoff Bryant Process software

Chapter 6. The Transport Layer. Transport Layer 3-1

Lecture 7. Followup. Review. Communication Interface. Socket Communication. Client-Server Model. Socket Programming January 28, 2005

Context. Distributed Systems: Sockets Programming. Alberto Bosio, Associate Professor UM Microelectronic Departement

IO-Lite: A Unified I/O Buffering and Caching System

Group-A Assignment No. 6

Our Small Quiz. Chapter 10: I/O Subsystems (2) Generic I/O functionality. The I/O subsystem. The I/O Subsystem. The I/O Subsystem

Introduction and Overview Socket Programming Lower-level stuff Higher-level interfaces Security. Network Programming. Samuli Sorvakko/Nixu Oy

Processes communicating. Network Communication. Sockets. Addressing processes 4/15/2013

Networks and Operating Systems ( ) Chapter 10: I/O Subsystems (2)

Socket Programming. Dr. -Ing. Abdalkarim Awad. Informatik 7 Rechnernetze und Kommunikationssysteme

Sockets 15H2. Inshik Song

Overview. Last Lecture. This Lecture. Daemon processes and advanced I/O functions

Oral. Total. Dated Sign (2) (5) (3) (2)

Socket Programming. CSIS0234A Computer and Communication Networks. Socket Programming in C

Agenda. Before we start: Assignment #1. Routing in a wide area network. Protocols more concepts. Internetworking. Congestion control

NFS Design Goals. Network File System - NFS

Unix Network Programming

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Linux Kernel Application Interface

Programming Internet with Socket API. Hui Chen, Ph.D. Dept. of Engineering & Computer Science Virginia State University Petersburg, VA 23806

The User Datagram Protocol

CS118 Discussion 1A, Week 3. Zengwen Yuan Dodd Hall 78, Friday 10:00 11:50 a.m.

UNIX Sockets. Developed for the Azera Group By: Joseph D. Fournier B.Sc.E.E., M.Sc.E.E.

Socket Programming. Sungkyunkwan University. Hyunseung Choo Copyright Networking Laboratory

A Client-Server Exchange

ECE 650 Systems Programming & Engineering. Spring 2018

Introduction and Overview Socket Programming Higher-level interfaces Final thoughts. Network Programming. Samuli Sorvakko/Nixu Oy

Applications and Layered Architectures. Chapter 2 Communication Networks Leon-Garcia, Widjaja

ADRIAN PERRIG & TORSTEN HOEFLER ( ) 10: I/O

Transport Layer (TCP/UDP)

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Network Communication

CSE/EE 461 Lecture 14. Connections. Last Time. This Time. We began on the Transport layer. Focus How do we send information reliably?

Tutorial on Socket Programming

Interprocess Communication Mechanisms

shared storage These mechanisms have already been covered. examples: shared virtual memory message based signals

EEC-484/584 Computer Networks

CSE506: Operating Systems CSE 506: Operating Systems

CSE506: Operating Systems CSE 506: Operating Systems

CSE 4/521 Introduction to Operating Systems. Lecture 24 I/O Systems (Overview, Application I/O Interface, Kernel I/O Subsystem) Summer 2018

Network Implementation

Elementary TCP Sockets

Advanced Computer Networks. End Host Optimization

Mike Anderson. TCP/IP in Embedded Systems. CTO/Chief Scientist The PTR Group, Inc.

Outline. Operating Systems. Socket Basics An end-point for a IP network connection. Ports. Network Communication. Sockets and the OS

Outline. Distributed Computer Systems. Socket Basics An end-point for a IP network connection. Ports. Sockets and the OS. Transport Layer.

Memory-Mapped Files. generic interface: vaddr mmap(file descriptor,fileoffset,length) munmap(vaddr,length)

ELEC / COMP 177 Fall Some slides from Kurose and Ross, Computer Networking, 5 th Edition

Socket Programming TCP UDP

Sockets Sockets Communication domains

Flowreplay Design Notes

Lecture 11: IP routing, IP protocols

Randall Stewart, Cisco Systems Phill Conrad, University of Delaware

The Fundamentals. Port Assignments. Common Protocols. Data Encapsulation. Protocol Communication. Tevfik Ko!ar

Lecture 2. Outline. Layering and Protocols. Network Architecture. Layering and Protocols. Layering and Protocols. Chapter 1 - Foundation

Chapter 2 Computer-System Structure

Message Passing Architecture in Intra-Cluster Communication

(Refer Slide Time: 1:09)

sottotitolo Socket Programming Milano, XX mese 20XX A.A. 2016/17 Federico Reghenzani

STUDY OF SOCKET PROGRAMMING

I experiment on the kernel of linux environment.

CLIENT-SIDE PROGRAMMING

Intro to LAN/WAN. Transport Layer

Review: Hardware user/kernel boundary

Introduction and Overview Socket Programming Higher-level interfaces Final thoughts. Network Programming. Samuli Sorvakko/Nixu Oy

Network Programming in C: The Berkeley Sockets API. Networked Systems 3 Laboratory Sessions

Much Faster Networking

The Berkeley Sockets API. Networked Systems Architecture 3 Lecture 4

PCI Express System Interconnect Software Architecture for PowerQUICC TM III-based Systems

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

Chapter 6. What happens at the Transport Layer? Services provided Transport protocols UDP TCP Flow control Congestion control

Introduction to Socket Programming

Containers Do Not Need Network Stacks

TABLE OF CONTENTS 1 INTRODUCTION 1 COIP-K IMPLEMENTATION REQUIREMENTS... 5 THESIS OUTLINE NETWORKING BACKGROUND 8

UNIX Sockets. COS 461 Precept 1

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

System Programming. Sockets

COMMUNICATION PROTOCOLS: REMOTE PROCEDURE CALL (RPC)

Toward a Common Host Interface for Network Processors

Introduction to Computer Systems. Networks 2. c Theodore Norvell. The Sockets API

Socket Programming for TCP and UDP

WASHINGTON UNIVERSITY SEVER INSTITUTE OF TECHNOLOGY AN IMPLEMENTATION MODEL FOR CONNECTION-ORIENTED INTERNET PROTOCOLS

Transcription:

Networking Subsystem in Linux Manoj Naik IBM Almaden Research Center

Scope of the talk Linux TCP/IP networking layers Socket interfaces and structures Creating and using INET sockets Linux IP layer Socket calls in kernel Zerocopy networking

Linux TCP/IP Networking Layers Network Applications User Kernel Socket Interface BSD Sockets INET Sockets Protocol Layers TCP IP UDP ARP Network Devices PPP SLIP Ethernet

BSD Socket Interface Inter-process communication mechanism Address families UNIX, INET, X25... Socket types Stream, Datagram, Raw, Packet Processes in client-server model IP address, ports

Linux Socket Structures protocols vector address families and protocols name (eg. INET), init routine proto_ops data structures registered protocol operations eg. sendmsg, recvmsg

INET Socket Layer count close_on_exec open_fs fd[0] fd[1] file f_mode f_pos f_flags f_count f_owner f_op inode fd[255] files_struct f_inode f_version type ops data socket SOCK_STREAM Address Family socket operations type SOCK_STREAM protocol socket sock

Using BSD Sockets in Linux Creating a BSD socket Binding an address to an INET socket Making a connection on an INET socket Listening on an INET socket Accepting connection requests

Creating a BSD Socket int socket(family, type, protocol) Search pops for matching address family sock->proto_ops = pops[family] Allocate a new socket data structure VFS inode Call family-specific create routine Create and initialize new file structure file->f_ops = sock_ops

Binding an Address to INET Socket int bind(sockfd, sockaddr, socklen) Mostly handled by INET layer sock->state = TCP_CLOSE sockaddr contains IP address, port number sock->sk->saddr = <IP address> Routing of received packets Hash tables in TCP/UDP for address lookups Direct to correct socket/sock pair.

Making a Connection on INET Socket int connect(sockfd, sockaddr, len) socket->state should be SS_UNCONNECTED UDP connect setup addresses of remote application cache route in ip_route_cache TCP connect build TCP connect message add sock to tcp_listening_hash

Listening on INET Socket int listen(sockfd, backlog) Set sock->state = TCP_LISTEN Add sock to tcp_bound_hash Build a new sock for TCP bottom-half of TCP Clone incoming sk_buff and queue it on receive_queue

Accepting Connection Requests int accept(sockfd, sockaddr, len) TCP only Clone listening socket Add process to a wait queue and schedule On connect request return sock to INET socket layer Link sock to socket return socket fd to application

Linux IP Layer Interface with network devices Socket buffers Receiving IP packets Sending IP packets Data fragmentation

Network Device Structure dev_queue_xmit deliver packets netif_rx receive & queue packets Methods and variables struct device Initialization routine hard_start_xmit deliver frames dev_interrupt collect rx frames Physical Device and Media

Socket Buffers sk_buff next prev dev head data tail end len Packet to be transmitted truesize

sk_buff operations skb_push add data or headers to the start of data to be transmitted skb_pull remove data or headers from the start of received data skb_put add data to the end of packet to be transmitted skb_trim remove data or tailer from the received packet

Receiving IP Packets Network device converts received data into sk_buffs sk_buffs are added to backlog queue Bottom-half handler is flagged to run Backlog queue is processed IP fragments (ipq) are put in ipqueue list

Sending IP Packets Determine packet route for IP, use rtable Build sk_buff to contain data and protocol headers source IP address address of network device prebuilt hardware header (cached)

Socket Buffer Management void append_frame(char *buf, int len) { struct sk_buff *skb = alloc_skb(len, GFP_ATOMIC); if (skb == NULL) dropped++; else { skb_put(skb, len); memcpy(skb->data, buf, len); skb_append(&list, skb); } } void process_queue(void) { struct sk_buff *skb; while ((skb = skb_dequeue(&list))!= NULL) { process_data(skb); kfree_skb(skb, FREE_READ); } }

Higher Level Support Routines Receive Data sk = find_socket(something); if (sock_queue_rcv_skb(sk, skb) == -1) { dropped++; kfree(skb, FREE_READ); return; } Transmit Data skb = sock_alloc_send_skb(sk,...) if (skb == NULL) return -err; skb->sk = sk; skb_reserve(skb, headroom); skb_put(skb, len); memcpy(skb->data, data, len); protocol_do_something(skb);

Data Fragmentation Packet size for network device smaller than transmit data fragment fields in protocol header MTU for device determined from routing tables Each fragment represented by sk_buff Received fragments (ipq) stored in ipqueue list

Address Resolution Protocol (ARP) Translate IP address to physical hardware address Header rebuilding routine for translation ARP request IP address broadcast ARP response Hardware address from owner arp_table last used, updated IP address, h/w address & header timer, retries, sk_buff queue

Socket System Calls (TCP) Server socket() bind() listen() accept() Client socket() blocks until connection from client connection establishment connect() read() data (request) write() process request write() data (reply) read()

Invoking Socket Calls in Kernel Server socket() bind() err = sock_create(pf_inet, SOCK_STREAM, IPPROTO_TCP, &sock); err = sock->ops->bind(sock, &sin, sizeof(sin)); listen() accept() err = sock->ops->listen(sock, 48); err = sock->ops->accept(sock, newsock, O_NONBLOCK); Client socket() blocks until connection from client connection establishment connect() read() process request data (request) rc = sock_recvmsg(sock,&msg, len, flags); write() write() rc = sock_sendmsg(sock,&msg,len); data (reply) read()

Performance bottlenecks Per-packet and per-byte costs Data touching overheads Copying data between system and application buffers TCP Checksumming data integrity per byte or packet Zerocopy approach

Zerocopy I/O Memory mapped files access to static mappable objects Raw disk I/O synchronous Raw writes user buffer accessed directly by disk driver request blocks until end of data transfer Raw reads read buffer posted before disk I/O

Issues with Zerocopy in TCP Transmit side Retain user data for possible retransmission copy user data into a kernel buffer and put in outbound queue return asynchronously to user high throughput, buffer reuse

Issues with Zerocopy in TCP Receive side Packets arrive at network interface asynchronously user read buffers not usually posted limited interface memory copy incoming data into a kernel buffer and put in inbound queue

Zerocopy Schemes User accessible interface memory pre-mapping into user and kernel address spaces no copies complicated hardware support cache flushing intelligence in adapters to direct data substantial software changes special buffer management calls Limited interface memory memory leaks

Zerocopy Schemes Kernel-network shared memory DMA or program I/O to move data between interface memory and user buffers No changes in existing applications Co-management of buffer pool between kernel and interface hardware Pinning of user pages for DMA Retransmit buffers in buffer pool

Zerocopy Schemes User-kernel shared memory APIs with shared semantics between user and kernel address spaces DMA between shared memory and network interface Fast buffers (fbufs) - per process buffer pool pre-mapped in user and kernel Application compatibility problems Buffer pool fragmentation Targeted DMA transfer to correct memory pool

Zerocopy Schemes User-kernel page remapping + COW DMA transfer between interface memory and kernel buffers Data "transfer" through page remapping edit MMU tables Copy-on-write (COW) on transmit side Expensive VM operations Operations on page boundaries

Hardware Checksumming Calculate data checksums during DMA transfers Software checksums can be expensive with cold caches Modern interface adapters (Gbit) perform checksumming in hardware

Zerocopy in Linux Page, offset, length tuples Scatter-gather lists writepage, sendfile TCP socket options MSG_MORE TCP_CORK

Using sendfile() Original code while ((c = read(filefd, buf, sizeof(buf))) > 0) { if ((d = write(sockfd, buf, c)) < 0) break; bytes += c; } Modified code if (fstat(filefd, &statbuf) < 0) break; fsize = statbuf.st_size; bytes = sendfile(sockfd, filefd, &offset, fsize);

Using zerocopy TCP sendfile() Original code rc = read(filefd, packet->data, size); packet->hdr = build_header(rc); rc = send(sockfd, packet, packet_size, 0); Modified code /* Assume that file is locked and size won't change in the process of doing sendfile */ fstat(filefd, &statbuf); packet->hdr = build_header(statbuf.st_size); rc = send(sockfd, packet->hdr, hdrsize, MSG_MORE); rc = sendfile(sockfd, filefd, &offset, statbuf.st_size);