Improving the DragonFlyBSD Network Stack

Similar documents
Improving the DragonFlyBSD Network Stack

Much Faster Networking

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Open Source Traffic Analyzer

Advanced Computer Networks. End Host Optimization

Light & NOS. Dan Li Tsinghua University

Learning with Purpose

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

FreeBSD Network Stack Optimizations for Modern Hardware

ntop Users Group Meeting

TX bulking and qdisc layer

Xen Network I/O Performance Analysis and Opportunities for Improvement

Software Routers: NetMap

IsoStack Highly Efficient Network Processing on Dedicated Cores

Zevenet EE 4.x. Performance Benchmark.

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

DPDK Summit China 2017

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Keeping up with the hardware

The Power of Batching in the Click Modular Router

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

TCP Tuning for the Web

Scaling Facebook. Ben Maurer

AFS Performance. Simon Wilkinson

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

OpenFlow Software Switch & Intel DPDK. performance analysis

Precision Time Protocol, and Sub-Microsecond Synchronization

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

XDP: 1.5 years in production. Evolution and lessons learned. Nikita V. Shirokov

Memory Management Strategies for Data Serving with RDMA

Performance Tuning NGINX

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

MegaPipe: A New Programming Interface for Scalable Network I/O

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Toward MP-safe Networking in NetBSD

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

What s an API? Do we need standardization?

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

A Scalable Event Dispatching Library for Linux Network Servers

DPDK Integration within F5 BIG-IP BRENT BLOOD, SR MANAGER SOFTWARE ENGINEERING VIJAY MANICKAM, SR SOFTWARE ENGINEER

An FPGA-Based Optical IOH Architecture for Embedded System

Network device drivers in Linux

TCP/misc works. Eric Google

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

TRex Realistic Traffic Generator

Demystifying Network Cards

TOWARDS FAST IP FORWARDING

ECE 650 Systems Programming & Engineering. Spring 2018

FAQ. Release rc2

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Netchannel 2: Optimizing Network Performance

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Agilio CX 2x40GbE with OVS-TC

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

February 10-11, Uppsala, Sweden

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Evolution of the netmap architecture

QUIC. Internet-Scale Deployment on Linux. Ian Swett Google. TSVArea, IETF 102, Montreal

Linux Kernel Hacking Free Course

CSE506: Operating Systems CSE 506: Operating Systems

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Network stack virtualization for FreeBSD 7.0. Marko Zec


CSCI-GA Operating Systems. Networking. Hubertus Franke

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Lustre SMP scaling. Liang Zhen

Resource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C.

Improving PF's performance and reliability Lessons learned from bashing pf with DDoS attacks and other malicious traffic. By Kajetan Staszkiewicz,

Troubleshoot High CPU on ASR1000 Series Router

Internet Technology 3/2/2016

Accelerating VM networking through XDP. Jason Wang Red Hat

Networking in a Vertically Scaled World

An MP-capable network stack for DragonFlyBSD with minimal use of locks

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

Containers Do Not Need Network Stacks

The Challenges of XDP Hardware Offload

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Network stack specialization for performance

Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack

EbbRT: A Framework for Building Per-Application Library Operating Systems

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Networking Servers made for BSD and Linux systems

Bullet Cache. Balancing speed and usability in a cache server. Ivan Voras

Solarflare and OpenOnload Solarflare Communications, Inc.

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

Current status of NetBSD MP-safe network stack project

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

Configuring OpenFlow 1

Tuning TCP and NGINX on EC2

19: Networking. Networking Hardware. Mark Handley

Mellanox DPDK. Release Notes. Rev 16.11_4.0

6.9. Communicating to the Outside World: Cluster Networking

Intel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013

Transcription:

Improving the DragonFlyBSD Network Stack Yanmin Qiao sephe@dragonflybsd.org DragonFlyBSD project

Nginx performance and latency evaluation: HTTP/1.1, 1KB web object, 1 request/connect Intel X550 Server Intel X550 Server: 2x E5-2620v2, 32GB DDR3-1600, Hyperthreading enabled. nginx: Installed from dports. nginx.conf: Access log is disabled. 16K connections/worker. Intel X540 Client Intel X540 Client 15K concurrent connections on each client. Client: i7-3770, 16GB DDR3-1600, Hyperthreading enabled. MSL on clients and server are changed to 20ms by: route change -net net -msl 20 /boot/loader.conf: kern.ipc.nmbclusters=524288 /etc/sysctl.conf: machdep.mwait.cx.idle=autodeep kern.ipc.somaxconn=256 net.inet.ip.portrange.last=40000 2

IPv4 forwarding performance evalution: 64B packets pktgen+sink Intel X540 Intel X550 forwarder Intel X550 Intel X540 pktgen+sink: i7-3770, 16GB DDR3-1600, Hyperthreading enabled. forwarder: 2x E5-2620v2, 32GB DDR3-1600, Hyperthreading enabled. /boot/loader.conf: kern.ipc.nmbclusters=5242880 /etc/sysctl.conf: machdep.mwait.cx.idle=autodeep Traffic generator: DragonFlyBSD s in kernel packet generator, which has no issue to generate 14.8Mpps. Traffic sink: ifconfig ix0 monitor Traffic: Each pktgen targets 208 IPv4 addresses, which are mapped to one link layer address on forwarder. Performance counting: Only packets sent by the forwarder are counted. pktgen+sink 3

TOC 1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states. 4

Use all available CPUs for network processing Issues of using only power-of-2 count of CPUs. - On a system with 24 CPUs, only 16 CPUs will be used. Bad for kernel-only applications: forwarding and bridging. - Performance and latency of network heavy userland applications can not be fully optimized. - 6, 10, 12 and 14 cores per CPU package are quite common these days. 5

Use all available CPUs for network processing (24 CPUs, 16 netisrs) CONFIG.B: 32 nginx workers, no affinity. CONFIG.A.16: 16 nginx workers, affinity. None of them are optimal. 250000 200 215907.25 180 200000 191920.81 160 140 Requests/s 150000 100000 120 100 80 60 Latency (ms) performance latency-avg latency-stdev latency-.99 50000 33.11 32.04 40 20 0 CONFIG.B CONFIG.A.16 0 6

Use all available CPUs for network processing Host side is easy (RSS_hash & 127) % ncpus How NIC RSS redirect table should be setup? Simply do this? table[i] = i % nrings, i.e. 128 entries 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7

Use all available CPUs for network processing (NIC RSS redirect table) Issue with table[i]=i%nrings e.g. 24 CPUs, 16 RX rings. First 48 entries 0 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Host (RSS_hash&127)%ncpus 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 table[i]=i%nrings, 32 mismatches grid (grid=ncpus/k, ncpus%k==0) grid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Current, 16 mismatches patch patch 8

Use all available CPUs for network processing (result: nginx, 24 CPUs) CONFIG.A.24: 24 nginx workers, affinity 250000 200 215907.25 210658.8 180 200000 191920.81 160 140 Requests/s 150000 100000 120 100 80 Latency (ms) performance latency-avg latency-stdev latency-.99 58.01 60 50000 0 33.11 32.04 CONFIG.B CONFIG.A.16 CONFIG.A.24 40 20 0 9

Use all available CPUs for network processing (result: forwarding) Mpps 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Fast forwarding Normal forwarding 16 CPUs 24 CPUs 10

Use all available CPUs for network processing if_ringmap APIs for drivers if_ringmap_alloc(), if_ringmap_alloc2(), if_ringmap_free() Allocate/free a ringmap. if_ringmap_alloc2() for NICs requiring power-of- 2 count of rings, e.g. jme(4). if_ringmap_align() If N_txrings!=N_rxrings, it makes sure that TX ring and the respective RX ring are assigned to the same CPU. Required by bce(4) and bnx(4). if_ringmap_match() If N_txrings!=N_rxrings, it makes sure that distribution of TX rings and RX rings is more even. Used by igb(4) and ix(4). if_ringmap_cpumap(), if_ringmap_rdrtable() Get ring CPU binding information, and RSS redirect table. 11

TOC 1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states. 12

Direct input with polling(4) - Hold RX ring serializer. 13 Mainly for stability, i.e. bring down NIC when there are heavy RX traffic. - Have to requeue the mbufs. To avoid dangerous deadlock against RX ring serializer. Excessive L1 cache eviction and refetching. netisrn Hold RX serializer NIC_rxpoll(ifp1) NIC_rxeof(ifp1) Setup mbuf1 Queue mbuf1 Setup mbuf2 Queue mbuf2 Setup mbuf3 Queue mbuf3 Release RX serializer Dequeue mbuf1 ether_input_handler(mbuf1) ip_input(mbuf1) ip_forward(mbuf1) ip_output(mbuf1) ifp2->if_output(mbuf1) Dequeue mbuf2 ether_input_handler(mbuf2) ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) ifp3->if_output(mbuf2) Dequeue mbuf3 ether_input_handler(mbuf3) ip_input(mbuf3) ip_forward(mbuf3) ip_output(mbuf3) ifp2->if_output(mbuf3) netisrn msgport mbuf3 mbuf2 mbuf1

Direct input with polling(4) - Driver synchronizes netisrs at stop time. - Then holding RX ring serializer is no longer needed. - Mbuf requeuing is not necessary. netisrn NIC_rxpoll(ifp1) NIC_rxeof(ifp1) Setup mbuf1 ether_input_handler(mbuf1) ip_input(mbuf1) ip_forward(mbuf1) ip_output(mbuf1) ifp2->if_output(mbuf1) Setup mbuf2 ether_input_handler(mbuf2) ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) ifp3->if_output(mbuf2) Setup mbuf3 ether_input_handler(mbuf3) ip_input(mbuf3) ip_forward(mbuf3) ip_output(mbuf3) ifp2->if_output(mbuf3) 14

Direct input with polling(4) (result: forwarding) Mpps 14 13.5 13 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Fast forwarding Normal forwarding requeue direct 15

TOC 1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states. 16

Kqueue(2) accept queue length report Applications, e.g. nginx, utilize kqueue(2) reported accept queue length like this: do { s = accept(ls, &addr, &addrlen); /* Setup accepted socket. */ } while (--accept_queue_length); The Setup accepted socket part: - Time consuming. - Destabilize and increase request handling latency. 17

Kqueue(2) accept queue length report Backlog of listen(2) socket. Major factor affects kqueue(2) accept queue length report. Low listen(2) socket backlog. e.g. 32, excessive connection drops. High listen(2) socket backlog. e.g. 256, destabilize and increase request handling latency. 18

Kqueue(2) accept queue length report The solution: - Leave listen(2) backlog high, e.g. 256. - Limit the accept queue length from kqueue(2), e.g. 32. Simple one liner change. 19

Kqueue(2) accept queue length report (result: nginx, 24 CPUs) FINAL: 24 nginx workers, affinity, accept queue length reported by kqueue(2) is limited to 32. 250000 200 215907.25 210658.8 217599.01 180 200000 160 140 Requests/s 150000 100000 120 100 80 Latency (ms) performance latency-avg latency-stdev latency-.99 58.01 60 50000 33.11 32 40 20 0 CONFIG.B CONFIG.A.24 FINAL 0 20

TOC 1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states. 21

Sending aggregation from the network stack - High performance drivers have already deployed mechanism to reduce TX ring doorbell writes. - The effectiveness of the drivers optimization depends on how network stack feeds the packets. 22

Sending aggregation from the network stack Preconditions: - All packet sending runs in netisrs. - Ordered rollup : Protocols register rollup function. TCP ACK aggregation rollup with higher priority. Sending aggregation rollup with lower priority. Rollups are called when: No more netmsg. 32 netmsgs are handled. netisrn Dequeue ifp1 RX ring[n] polling msg NIC_rxpoll(ifp1, RX ring[n]) NIC_rxeof(ifp1, RX ring[n]) Setup mbuf1 ether_input_handler(mbuf1) ip_input(mbuf1) ip_forward(mbuf1) ip_output(mbuf1) ifp2->if_output(mbuf1) Queue mbuf1 to ifp2.ifq[n] Setup mbuf2 ether_input_handler(mbuf2) ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) ifp3->if_output(mbuf2) Queue mbuf2 to ifp3.ifq[n] netisrn.msgport has no more msgs rollup1() // Flush pending aggregated TCP ACKs tcp_output(tcpcb1) ip_output(mbuf3) ifp2->if_output(mbuf3) Queue mbuf3 to ifp2.ifq[n] tcp_output(tcpcb2) ip_output(mbuf4) ifp3->if_output(mbu4) Queue mbuf4 to ifp3.ifq[n] rollup2() // Flush pending mbufs on IFQs ifp2->if_start(ifp2.ifq[n]) Put mbuf1 to ifp2 TX ring[n] Put mbuf3 to ifp2 TX ring[n] Write ifp2 TX ring[n] doorbell reg ifp3->if_start(ifp3.ifq[n]) Put mbuf2 to ifp3 TX ring[n] Put mbuf4 to ifp3 TX ring[n] Write ifp3 TX ring[n] doorbell reg Wait for more msgs on netisrn.msgport 23

Sending aggregation from the network stack (result: forwarding) Mpps 14.4 14.2 14 13.8 13.6 13.4 13.2 13 12.8 12.6 12.4 12.2 12 11.8 11.6 11.4 11.2 11 10.8 10.6 10.4 10.2 10 9.8 9.6 9.4 9.2 9 Fast forwarding Normal forwarding aggregate 4 aggregate 16 24

TOC 1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states. 25

Per CPU IPFW states IPFW states were rwlocked, even if IPFW rules were per CPU. cpu0 cpu1 cpu2 cpu3 state 0 1 2 3 State table Global shared Protected by rwlock rule0 rule0 rule0 rule0 rule1 rule1 rule1 rule1 Default rule Default rule Default rule Default rule 26

Per CPU IPFW states - UDP was RSS hash aware in 2014. Prerequisite for per CPU IPFW state. - Rwlocked IPFW states work pretty well, if the states are relatively stable. e.g. long-lived HTTP/1.1 connections. - Rwlocked IPFW states suffer a lot for short lived states; r/w contention. 27 e.g. short-lived HTTP/1.1 connections.

Per CPU IPFW states global state counter cpu0 cpu1 cpu2 cpu3 State table counter State table counter State table counter State table counter state1 state0 state2 state3 state4 state5 rule0 rule0 rule0 rule0 rule1 rule1 rule1 rule1 Default rule Default rule Default rule Default rule 28

Per CPU IPFW states Loose state counter update - Implemented by Matthew Dillon for DragonFlyBSD s slab Allocator. It avoids expensive atomic operations. - Per CPU state counter. Updated when states are installed/removed from the respective CPU. - Per CPU state counter will be added to global state counter, once it is above a threshold. - Per CPU state counters will be re-merged into global state counter, if the global state counter goes beyond the limit. - Allow 5% overflow. 29

Per CPU IPFW states (limit rule) Track count table Global shared locked trackcnt0 count=6 cpu0 cpu1 cpu2 cpu3 Track counter tier Track tier Track table Track table Track table Track table track0 state list count ptr track1 state list count ptr track2 state list count ptr State table State table State table state0 State table state4 state1 state2 state3 state5 rule0 rule0 rule0 rule0 rule1 rule1 rule1 rule1 Default rule Default rule Default rule Default rule 30

Per CPU IPFW states (result: nginx) On server running nginx: ipfw add 1 check-state ipfw add allow tcp from any to me 80 setup keep-state 250000 300 200000 210658.8 191626.58 250 200 Requests/s 150000 100000 153.76 performance 150 latency-avg latency-stdev latency-.99 100 Latency (ms) 50000 58.01 64.74 43481.19 50 0 no IPFW percpu state rwlock state 0 31

Per CPU IPFW states (redirect) cpum cpun state table state table master state slave state rule0 rule0 redirect rule rule1 0 : M N : 0 : M N : redirect rule rule1 default rule default rule netisrm netisrn 32 ip_input(mbuf) ipfw_chk(mbuf) Match redirect rule Create master state Link master state to redirect rule[m] Install master state Xlate mbuf Rehash mbuf Create slave state Link slave state to redirect rule[n] Tuck slave state into mbuf.netmsg Send mbuf.netmsg to netisrn Dequeue mbuf.netmsg ipfw_xlate_redispatch(mbuf.netmsg) ip_input(mbuf) ipfw_chk(mbuf) Install slave state Continue from redirect rule s act ip_forward(mbuf) ip_output(mbuf) ipfw_chk(mbuf) Match slave state Xlate direction mismatch, no xlate Continue from redirect rule s act ifp->if_output(mbuf)

Performance Impact of 33

Performance Impact of Meltdown and Spectre 250000 60 224082.8 51.12 54.43 200000 201859.11 44.43 47.45 50 Requests/s 150000 100000 35 29.68 39.9 33.46 154723.94 145031.1 40 30 Latency (ms) performance latency-avg latency-stdev latency-.99 20 50000 4.16 4.22 4.78 5.05 10 0 base meltdown spectre melt+spectre 0 34