Improving DPDK Performance

Similar documents
P51: High Performance Networking

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

All product specifications are subject to change without notice.

Design and Implementation of Virtual TAP for Software-Defined Networks

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Intel Select Solution for ucpe

HP Z Turbo Drive G2 PCIe SSD

Agilio CX 2x40GbE with OVS-TC

Hardware Acceleration for Measurements in 100 Gb/s Networks

An Experimental review on Intel DPDK L2 Forwarding

MWC 2015 End to End NFV Architecture demo_

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

Service Edge Virtualization - Hardware Considerations for Optimum Performance

Virtual Switch Acceleration with OVS-TC

Whitepaper / Benchmark

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

A Look at Intel s Dataplane Development Kit

Using (Suricata over) PF_RING for NIC-Independent Acceleration

NetFPGA Hardware Architecture

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

An Intelligent NIC Design Xin Song

Getting Real Performance from a Virtualized CCAP

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform

Flexible network monitoring at 100Gbps. and beyond

INT G bit TCP Offload Engine SOC

6WINDGate. White Paper. Packet Processing Software for Wireless Infrastructure

Data Path acceleration techniques in a NFV world

G-NET: Effective GPU Sharing In NFV Systems

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Extreme Networks Session Director

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

The rcuda middleware and applications

Intel s Architecture for NFV

PacketShader: A GPU-Accelerated Software Router

COSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

VXS-621 FPGA & PowerPC VXS Multiprocessor

Advanced Computer Networks. End Host Optimization

Moving Forward Native PCIe Interface SSD for Industrial Applications

The Power of Batching in the Click Modular Router

LegUp: Accelerating Memcached on Cloud FPGAs

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Software Routers: NetMap

ntop Users Group Meeting

Improving Packet Processing Performance of a Memory- Bounded Application

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

Intel Core i7 Processor

6.9. Communicating to the Outside World: Cluster Networking

ODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013

Enabling Fast, Dynamic Network Processing with ClickOS

Goverlan Reach Server Hardware & Operating System Guidelines

Improve Performance of Kube-proxy and GTP-U using VPP

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

Agilio OVS Software Architecture

The Myricom ARC Series with DBL

Virtualization of Customer Premises Equipment (vcpe)

Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C

Extreme I/O Expandability with High VR Power Efficiency

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

Altos T310 F3 Specifications

The Optimal CPU and Interconnect for an HPC Cluster

FAQ. Release rc2

Supporting Fine-Grained Network Functions through Intel DPDK

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades

Pactron FPGA Accelerated Computing Solutions

Streamlined feature-rich ATX UP serverboard. The most scalable I/O expandability

WIND RIVER TITANIUM CLOUD FOR TELECOMMUNICATIONS

BlueGene/L. Computer Science, University of Warwick. Source: IBM

URDMA: RDMA VERBS OVER DPDK

OPEN COMPUTE PLATFORMS POWER SOFTWARE-DRIVEN PACKET FLOW VISIBILITY, PART 2 EXECUTIVE SUMMARY. Key Takeaways

DPDK Intel Cryptodev Performance Report Release 17.11

Pexip Infinity Server Design Guide

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

FPGA accelerated application monitoring in 40 and 100G networks

Programmable Server Adapters: Key Ingredients for Success

Microsoft SQL Server 2012 Fast Track Reference Architecture Using PowerEdge R720 and Compellent SC8000

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Casa Systems Axyom Multiservice Router

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

Microsoft Windows MultiPoint Server 2010 Reference Architecture for Dell TM OptiPlex TM Systems

DPDK Performance Report Release Test Date: Nov 16 th 2016

Maximizing Memory Performance for ANSYS Simulations

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers

Netronome NFP: Theory of Operation

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Industry Collaboration and Innovation

Much Faster Networking

Avid Configuration Guidelines Lenovo P720 workstation Dual 8 to 28 Core CPU System

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

Avid Configuration Guidelines Dell T5610 Dual 6-Core, Dual 8-Core & Dual 12-Core CPU Media Composer Symphony NewsCutter 10.5.

Single Root I/O Virtualization (SR-IOV) and iscsi Uncompromised Performance for Virtual Server Environments Leonid Grossman Exar Corporation

DPDK Intel Cryptodev Performance Report Release 18.08

Avid Configuration Guidelines LENOVO ThinkStation S30 Six-Core CPU Workstation Media Composer 6.x Symphony 6.x NewsCutter 10.

Transcription:

Improving DPDK Performance Data Plane Development Kit (DPDK) was pioneered by Intel as a way to boost the speed of packet API with standard hardware. DPDK-enabled applications typically show four or more times better performance compared to their non-dpdk counterparts. This is due to the kernel bypass and full userspace processing provided by DPDK. There are so many commercial and open source applications supporting DPDK that it became a de facto standard in the SDN and NFV world. Open vswitch is a prominent example of a popular DPDK-enabled component of many NFV deployments. The Intel ONP Server Performance Test Report from 2015 shows that DPDK-enabled Open vswitch can process 40 Gbps on a single core for packet lengths above a certain threshold. It can be expected that even more raw throughput will be required in the near future, enabling higher density NFV deployments and better CPU core utilization for communication intensive tasks. This whitepaper aims to explore the absolute performance limits of DPDK API using Netcope FPGA Boards (NFB) and Netcope Development Kit (NDK). About NFB and NDK Netcope FPGA Boards (NFB), FPGA-based programmable network interface cards, are unique examples of the symbiosis of state-of-the-art technologies fitting together perfectly in terms of achievable performance and throughput. The network link speed, performance of the on-board network controller, the throughput of PCI Express bus and performance of the host system these are all factors influencing the whole solution. Maximum attention was paid during the whole product design process to make all the links of this chain as strong as possible. Netcope Development Kit is a toolset for rapid development of hardware accelerated network applications based on Netcope FPGA Boards. It is based on a sophisticated build system and a collection of IP cores and software. It offers a comprehensive environment enabling prototyping of applications in the shortest time possible - an invaluable feature for solution vendors, integrators and R&D teams. DPDK, NDP and Accelerated DPDK Basic principles of DPDK The way the memory is arranged in DPDK is shown in the following figure. The same scheme is used for each receive/transmit channel. There is a ring buffer of descriptors, a descriptor being a record that contains a pointer to a packet buffer. Therefore, packet buffers are standalone and are placed non-continuously in RAM. This has several implications: There is a pool of free packet buffers that can be used. Every packet can be processed individually and each packet buffer is released to be re-used when appropriate. Every packet has to be transferred individually over the PCI Express bus, as no continuity of the memory is used for individual packet buffers. Every buffer has to be of the size of the biggest packet, or buffer chaining must be used. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 1

An alternative approach: NDP Netcope Data Plane API (NDP) has been co-developed by Netcope Technologies and CESNET research teams with the goal to maximize throughput between FPGA-based network interface cards and host system memory achievable by the PCI Express interface used. The reason is that nowadays CPUs provide memory bandwidth of up to 85 GBps (or 680 Gbps, e.g. Intel Xeon E7 v4 family) while the theoretical throughput of PCI Express (gen3 x16) is 16 GBps (or 128 Gbps). There is a drive to treat PCI Express bandwidth as a scarce resource as it can quickly become a bottleneck in the system. The way the memory is arranged in NDP is shown in the following figure. The depicted scheme is used for each receive/transmit channel. There is a ring buffer of descriptors and another ring buffer for packets. The ring buffer for packets is composed out of individual memory pages, typically 4 MB per page. The descriptors then link these individual pages into a continuous ring buffer. This has several implications: There is no pool of free packet buffers, only read and write pointers for each channel. Every packet has to be processed and released before the next one is processed. Packets are aggregated into larger PCI Express transfers, as the memory is continuous. This reduces PCI Express bus overhead significantly, especially for short packets. Memory is used efficiently as the packets are stored one after each other. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 2

DPDK accelerated by NDP To provide an easy to use DPDK API, while keeping the PCI Express bus overhead low as in NDP, we have created a mixed solution. The lower layer (the one that actually runs on PCI Express) of our solution is NDP, while the upper layer adjusts the packet data to fully adhere to the DPDK API and data model. This approach outperforms classical packet-based DPDK due to its PCI Express friendliness. Results We have performed an extensive set of measurements to evaluate the benefits and drawbacks of each approach. The tests were run on the NFB-100G2Q FPGA board and Intel Xeon E5-2687W v4 CPU with 12 cores (24 with HyperThreading, which was left enabled) running at 3 GHz (3.5 GHz turbo, 30 MB of L2 cache) and 64 GB of DDR4 RAM (eight modules of 8 GB) running at 2400 MHz. In some cases the achieved throughput is greater than what the 100 Gbps Ethernet standard allows. Therefore, instead of using 100 Gbps Ethernet interface as test input and output, we use a custom FPGA firmware that generates and consumes packets internally at any speed and packet length we command it to. The firmware runs in the Virtex-7 FPGA at 233 MHz and uses single PCI Express gen3 slot with 16 lines (x16). However, due to the limitation of the Virtex-7 FPGA, the slot is logically split into two PCI Express interfaces with 8 lines (x8) each. This is called bifurcation and is a standard feature of selected motherboards (and their BIOS). PCI-E slot bifurcation is fully hidden in the device driver implementation, so that the card user sees only a single consistent and easy to use interface. Three tests were run for each packet transfer method (DPDK, NDP, DPDK accelerated by NDP): RX, TX and RX+TX. Within each test, 1, 2, 4 and 8 cores were tested to evaluate CPU performance requirements. The graphs show how system throughput varies with packet length. Throughput is evaluated as the measured pure PCI Express data throughput without plotting the bus overhead or the transfer overhead (pointer update, descriptor download, etc.). Only the mandatory 16 B header is counted for each packet in RX direction and 8 B for TX in NDP and DPDK accelerated by NDP transfers. The plots are compared to the calculated PCI Express bandwidth required to transfer full 100 Gbps Ethernet at respective packet lengths (purple lines). DPDK NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 3

Perhaps contrary to popular belief, if well implemented, DPDK has no issues achieving full 100 Gbps throughput for packet lengths above a certain threshold for RX and TX directions separately. However, for short packets, and also when both directions are used, the disadvantage of inefficient PCI Express transfers becomes clearly visible. The saw-like shape of the lines is due to PCI Express transaction length aligning - an inevitable effect of per-packet bus transfers. NDP NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 4

NDP transfers tell a completely different story. With the exception of using only one core, NDP has exceptionally stable performance for all packet lengths, providing a good margin on top of what is needed for 100 Gbps Ethernet. And no, you cannot do 100 Gbps processing on a single CPU core. DPDK accelerated by NDP NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 5

Since it uses NDP internally, our DPDK accelerated by NDP shows no saw-like shape in its throughput graphs. For RX and TX separately, full 100 Gbps Ethernet throughput at all packet lengths can be achieved with eight CPU cores - something that is not possible at all with plain DPDK per-packet bus transfers. DPDK accelerated by NDP is therefore a very viable option for extremely fast packet API. Summary The best part of this story is that all three options are available for Netcope FPGA Boards, Netcope Development Kit, and other products built on top of these, such as Netcope Packet Capture or Netcope Session Filter. So for every application, make your choice: Choose DPDK if your application is intended to run at speeds below 100 Gbps and you want to save CPU time. Choose DPDK accelerated by NDP if you need full DPDK API compatibility as well as very high throughput, and you plan to use a high-performance CPU for your task. Choose NDP if you need to work very near to the theoretical performance limits of your system, and you do not mind using proprietary, yet still a very easy to use NDP API. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 6