打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Similar documents
高可用集群的开源解决方案 的发展与实现. 北京酷锐达信息技术有限公司 技术总监史应生

Solarflare and OpenOnload Solarflare Communications, Inc.

NFS on the Fast track - fine tuning and futures

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Much Faster Networking

Advanced Computer Networks. End Host Optimization

FPGA Augmented ASICs: The Time Has Come

Networking at the Speed of Light

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

A System-Level Optimization Framework For High-Performance Networking. Thomas M. Benson Georgia Tech Research Institute

IsoStack Highly Efficient Network Processing on Dedicated Cores

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

10GE network tests with UDP. Janusz Szuba European XFEL

Performance Optimisations for HPC workloads. August 2008 Imed Chihi

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

ETHERNET OVER INFINIBAND

Network Adapters. FS Network adapter are designed for data center, and provides flexible and scalable I/O solutions. 10G/25G/40G Ethernet Adapters

Support for Smart NICs. Ian Pratt

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

NIC-PCIE-4RJ45-PLU PCI Express x4 Quad Port Copper Gigabit Server Adapter (Intel I350 Based)

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Cloud Networking with OpenStack. 马力 海云捷迅 (AWcloud)

How to abstract hardware acceleration device in cloud environment. Maciej Grochowski Intel DCG Ireland

ef_vi User Guide SF CD, Issue 5 Solarflare Communications Inc 2017/05/25 15:51:40

Data Path acceleration techniques in a NFV world

RoCE Update. Liran Liss, Mellanox Technologies March,

Implementation and Analysis of Large Receive Offload in a Virtualized System

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

PE2G6BPi35 Six Port Copper Gigabit Ethernet PCI Express Bypass Server Adapter Intel based

Achieve Low Latency NFV with Openstack*

ETHOS A Generic Ethernet over Sockets Driver for Linux

Optimizing Performance: Intel Network Adapters User Guide

Multifunction Networking Adapters

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Netchannel 2: Optimizing Network Performance

ntop Users Group Meeting

Optimizing the GigE transfer What follows comes from company Pleora.

Programmable NICs. Lecture 14, Computer Networks (198:552)

INT G bit TCP Offload Engine SOC

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

FAQ. Release rc2

08:End-host Optimizations. Advanced Computer Networks

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

An Intelligent NIC Design Xin Song

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Storage Protocol Offload for Virtualized Environments Session 301-F

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Single Root I/O Virtualization (SR-IOV) and iscsi Uncompromised Performance for Virtual Server Environments Leonid Grossman Exar Corporation

Mark Falco Oracle Coherence Development

XE1-P241. XE1-P241 PCI Express PCIe x4 Dual SFP Port Gigabit Server Adapter (Intel I350 Based) Product Highlight

PE2G6I35 Six Port Copper Gigabit Ethernet PCI Express Server Adapter Intel i350am2 Based

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

PLUSOPTIC NIC-PCIE-2SFP+-V2-PLU

Benchmarking message queue libraries and network technologies to transport large data volume in

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Intel Ethernet Server Bypass Adapter X520/X540 Family

iwarp Learnings and Best Practices

RoCE vs. iwarp Competitive Analysis

Introduction to TCP/IP Offload Engine (TOE)

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 5.5/6.0 Release Notes

Infiniband and RDMA Technology. Doug Ledford

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

PE2G4B19L Quad Port Copper Gigabit Ethernet PCI Express Server Adapter Broadcom BCM5719 Based

PE2G4SFPI35L Quad Port SFP Gigabit Ethernet PCI Express Server Adapter Intel i350am4 Based

Mellanox DPDK. Release Notes. Rev 16.11_4.0

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

北 京 忆 恒 创 源 科 技 有 限 公 司 16

DPDK Summit China 2017

FreeBSD Network Performance Tuning

A-GEAR 10Gigabit Ethernet Server Adapter X520 2xSFP+

Motivation CPUs can not keep pace with network

Agilio CX 2x40GbE with OVS-TC

Virtio 1 - why do it? And - are we there yet? Michael S. Tsirkin Red Hat

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

OE2G2I35 Dual Port Copper Gigabit Ethernet OCP Mezzanine Adapter Intel I350BT2 Based

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

INT 1011 TCP Offload Engine (Full Offload)

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

PE2G6SFPI35 Six Port SFP Gigabit Ethernet PCI Express Server Adapter Intel i350am4 Based

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Intel Ethernet Server Adapter XL710 for OCP

Arrakis: The Operating System is the Control Plane

Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams

Suricata Extreme Performance Tuning With Incredible Courage

Network Design Considerations for Grid Computing

The Myricom ARC Series with DBL

6.9. Communicating to the Outside World: Cluster Networking

WebSphere MQ Low Latency Messaging V2.1. High Throughput and Low Latency to Maximize Business Responsiveness IBM Corporation

Myri-10G Myrinet Converges with Ethernet

2008 International ANSYS Conference

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Virtualization, Xen and Denali

IOV hardware and software plumbing for VMware ESX - journey to uncompromised I/O Virtualization and Convergence.

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Network device drivers in Linux

Transcription:

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生 shiys@solutionware.com.cn

BY DEFAULT, LINUX NETWORKING NOT TUNED FOR MAX PERFORMANCE, MORE FOR RELIABILITY

Trade-off :Low Latency, throughput, determinism

Performance Goals Throughput Optimize for best average Default design criteria for most operating systems "how much can you do at a time? Low Latency Optimize for best minimum Minimize execution times for certain paths "what's the fastest we can push a packet out? Determinism Optimize for best (lowest) maximum Fewest/lowest outliers "what's the maximum time it will take?"

State of the Art NIC characteristics 56 Gigabits per second (Unix network stack was designed for 10Mbits) 4-7 Gigabytes per second (Unix: 1 MB/s) >8 million packets per second (Unix: ~1000 packets per second). Less than a microsecond per packet for processing.

Latency Factors

BIOS Settings for Low Latency

CSTATE default C7 on this config

CSTATE disabled Note speed

NPtcp latency vs cstates c7 vs c0

Firmware tuning impact a drastic picture

Different Technology To Max Performance IPOIB = Kernel Sockets layer using IP emulation on Infiniband. SDP = Kernel Sockets layer using Infiniband native connection. IB = Native Infiniband connection. User space User Space Rsockets = Socket Emulation layer in user space Performance comparison shows that kernel processing is detrimental to performance. Bypass is essential.

Different Technology To Max Performance

Why bypass the kernel? Kernel is too slow and inefficient at high packet rates. Problems already begin at 10G. Contemporary devices can map user space memory and perform transfer to user space. Kernel must copy data between kernel buffers and userspace. Kernel is continually regressing in terms of the overhead of basic system calls and operations. Only new hardware compensates.

Sending a message via the sockets API

Kernel Bypass

Kernel Bypass

VNIC per CPU core (RSS) RX queue per CPU core TX queue per CPU core Complete CPU core separation Performance scales across CPUs

Virtual NICs for application acceleration

Virtual NICs for VM Same model used for SR-IOV In this case VM has direct access to VNIC(s) via SR-IOV VF

Acceleration Middleware Just a library and a kernel module No application changes No recompile No kernel patches No protocol changes Picks up existing Linux network configuration IP addresses and route table Bonding (aka teaming)/ VLANs Multicast (IGMP) Kernel settings, e.g. socket buffer sizes

Offload Solarflare OpenOnload

Ethtool View and change Ethernet card settings Works mostly at the HW level ethtool -S provides HW level stats Counters since boot time, create scripts to calculate diffs ethtool -c - Interrupt coalescing ethtool -g - provides ring buffer information ethtool -k - provides hw assist information ethtool -i - provides the driver information

sysctl popular settings These settings are often mentioned in tuning guides net.ipv4.tcp_window_scaling toggles window scaling net.ipv4.tcp_timestamps toggles TCP timestamp support net.ipv4.tcp_sack toggles SACK (Selective ACK) support

sysctl core memory settings CORE memory settings net.core.(r/w)mem_max max size of (r/w)x socket buffer net.core.(r/w)mem_default default (r/w)x size of socket buffer net.core.optmem_max maximum amount of option memory buffers net.core.netdev_max_backlog how many unprocessed rx packets before kernel starts to drop These settings also impact UDP!

Effect of net.core.rmem_max on read throughput

Offload is Replacement of what could be done in software with dedicated hardware. Overlaps with Bypass because direct device interactions replaces software action in the kernel through the actions of a hardware device. Typical case of hardware offload: DMA engines, GPUs, Rendering screens, cryptography, TCP (TOE), FPGAs.

Network Card Hardware Tuning Jumbo Frames Transmission queue Multi streams interrupt moderation RX, TX checksum offload TCP Segmentation Offload TCP Large Receive Offload (LRO)

Numa In Network Transfer

Performance diagnostic tools

Performance diagnostic tools

Coming in 2013 100 GB/sec networking >100 GB/sec SSD / Flash devices More cores in Intel processors. GPUs already support thousands of hardware threads. Newer models will offer more.

Who Are We? 中国领先的 Linux 全面解决方案提供商

Join Us...... hr@solutionware.com.cn

Q&A