Improving DPDK Performance - PDF Free Download

Improving DPDK Performance Data Plane Development Kit (DPDK) was pioneered by Intel as a way to boost the speed of packet API with standard hardware. DPDK-enabled applications typically show four or more times better performance compared to their non-dpdk counterparts. This is due to the kernel bypass and full userspace processing provided by DPDK. There are so many commercial and open source applications supporting DPDK that it became a de facto standard in the SDN and NFV world. Open vswitch is a prominent example of a popular DPDK-enabled component of many NFV deployments. The Intel ONP Server Performance Test Report from 2015 shows that DPDK-enabled Open vswitch can process 40 Gbps on a single core for packet lengths above a certain threshold. It can be expected that even more raw throughput will be required in the near future, enabling higher density NFV deployments and better CPU core utilization for communication intensive tasks. This whitepaper aims to explore the absolute performance limits of DPDK API using Netcope FPGA Boards (NFB) and Netcope Development Kit (NDK). About NFB and NDK Netcope FPGA Boards (NFB), FPGA-based programmable network interface cards, are unique examples of the symbiosis of state-of-the-art technologies fitting together perfectly in terms of achievable performance and throughput. The network link speed, performance of the on-board network controller, the throughput of PCI Express bus and performance of the host system these are all factors influencing the whole solution. Maximum attention was paid during the whole product design process to make all the links of this chain as strong as possible. Netcope Development Kit is a toolset for rapid development of hardware accelerated network applications based on Netcope FPGA Boards. It is based on a sophisticated build system and a collection of IP cores and software. It offers a comprehensive environment enabling prototyping of applications in the shortest time possible - an invaluable feature for solution vendors, integrators and R&D teams. DPDK, NDP and Accelerated DPDK Basic principles of DPDK The way the memory is arranged in DPDK is shown in the following figure. The same scheme is used for each receive/transmit channel. There is a ring buffer of descriptors, a descriptor being a record that contains a pointer to a packet buffer. Therefore, packet buffers are standalone and are placed non-continuously in RAM. This has several implications: There is a pool of free packet buffers that can be used. Every packet can be processed individually and each packet buffer is released to be re-used when appropriate. Every packet has to be transferred individually over the PCI Express bus, as no continuity of the memory is used for individual packet buffers. Every buffer has to be of the size of the biggest packet, or buffer chaining must be used. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 1

An alternative approach: NDP Netcope Data Plane API (NDP) has been co-developed by Netcope Technologies and CESNET research teams with the goal to maximize throughput between FPGA-based network interface cards and host system memory achievable by the PCI Express interface used. The reason is that nowadays CPUs provide memory bandwidth of up to 85 GBps (or 680 Gbps, e.g. Intel Xeon E7 v4 family) while the theoretical throughput of PCI Express (gen3 x16) is 16 GBps (or 128 Gbps). There is a drive to treat PCI Express bandwidth as a scarce resource as it can quickly become a bottleneck in the system. The way the memory is arranged in NDP is shown in the following figure. The depicted scheme is used for each receive/transmit channel. There is a ring buffer of descriptors and another ring buffer for packets. The ring buffer for packets is composed out of individual memory pages, typically 4 MB per page. The descriptors then link these individual pages into a continuous ring buffer. This has several implications: There is no pool of free packet buffers, only read and write pointers for each channel. Every packet has to be processed and released before the next one is processed. Packets are aggregated into larger PCI Express transfers, as the memory is continuous. This reduces PCI Express bus overhead significantly, especially for short packets. Memory is used efficiently as the packets are stored one after each other. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 2

DPDK accelerated by NDP To provide an easy to use DPDK API, while keeping the PCI Express bus overhead low as in NDP, we have created a mixed solution. The lower layer (the one that actually runs on PCI Express) of our solution is NDP, while the upper layer adjusts the packet data to fully adhere to the DPDK API and data model. This approach outperforms classical packet-based DPDK due to its PCI Express friendliness. Results We have performed an extensive set of measurements to evaluate the benefits and drawbacks of each approach. The tests were run on the NFB-100G2Q FPGA board and Intel Xeon E5-2687W v4 CPU with 12 cores (24 with HyperThreading, which was left enabled) running at 3 GHz (3.5 GHz turbo, 30 MB of L2 cache) and 64 GB of DDR4 RAM (eight modules of 8 GB) running at 2400 MHz. In some cases the achieved throughput is greater than what the 100 Gbps Ethernet standard allows. Therefore, instead of using 100 Gbps Ethernet interface as test input and output, we use a custom FPGA firmware that generates and consumes packets internally at any speed and packet length we command it to. The firmware runs in the Virtex-7 FPGA at 233 MHz and uses single PCI Express gen3 slot with 16 lines (x16). However, due to the limitation of the Virtex-7 FPGA, the slot is logically split into two PCI Express interfaces with 8 lines (x8) each. This is called bifurcation and is a standard feature of selected motherboards (and their BIOS). PCI-E slot bifurcation is fully hidden in the device driver implementation, so that the card user sees only a single consistent and easy to use interface. Three tests were run for each packet transfer method (DPDK, NDP, DPDK accelerated by NDP): RX, TX and RX+TX. Within each test, 1, 2, 4 and 8 cores were tested to evaluate CPU performance requirements. The graphs show how system throughput varies with packet length. Throughput is evaluated as the measured pure PCI Express data throughput without plotting the bus overhead or the transfer overhead (pointer update, descriptor download, etc.). Only the mandatory 16 B header is counted for each packet in RX direction and 8 B for TX in NDP and DPDK accelerated by NDP transfers. The plots are compared to the calculated PCI Express bandwidth required to transfer full 100 Gbps Ethernet at respective packet lengths (purple lines). DPDK NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 3

Perhaps contrary to popular belief, if well implemented, DPDK has no issues achieving full 100 Gbps throughput for packet lengths above a certain threshold for RX and TX directions separately. However, for short packets, and also when both directions are used, the disadvantage of inefficient PCI Express transfers becomes clearly visible. The saw-like shape of the lines is due to PCI Express transaction length aligning - an inevitable effect of per-packet bus transfers. NDP NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 4

NDP transfers tell a completely different story. With the exception of using only one core, NDP has exceptionally stable performance for all packet lengths, providing a good margin on top of what is needed for 100 Gbps Ethernet. And no, you cannot do 100 Gbps processing on a single CPU core. DPDK accelerated by NDP NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 5

Since it uses NDP internally, our DPDK accelerated by NDP shows no saw-like shape in its throughput graphs. For RX and TX separately, full 100 Gbps Ethernet throughput at all packet lengths can be achieved with eight CPU cores - something that is not possible at all with plain DPDK per-packet bus transfers. DPDK accelerated by NDP is therefore a very viable option for extremely fast packet API. Summary The best part of this story is that all three options are available for Netcope FPGA Boards, Netcope Development Kit, and other products built on top of these, such as Netcope Packet Capture or Netcope Session Filter. So for every application, make your choice: Choose DPDK if your application is intended to run at speeds below 100 Gbps and you want to save CPU time. Choose DPDK accelerated by NDP if you need full DPDK API compatibility as well as very high throughput, and you plan to use a high-performance CPU for your task. Choose NDP if you need to work very near to the theoretical performance limits of your system, and you do not mind using proprietary, yet still a very easy to use NDP API. NETCOPE TECHNOLOGIES a.s., Sochorova 3232/34, Brno, 616 00, CZ 6