OpenFlow Software Switch & Intel DPDK. performance analysis

Similar documents
Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

DPDK Performance Report Release Test Date: Nov 16 th 2016

DPDK Intel NIC Performance Report Release 18.02

DPDK Intel NIC Performance Report Release 18.05

DPDK Intel NIC Performance Report Release 17.08

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

The Power of Batching in the Click Modular Router

Intel s Architecture for NFV

VALE: a switched ethernet for virtual machines

DPDK Roadmap. Tim O Driscoll & Chris Wright Open Networking Summit 2017

FAQ. Release rc2

DPDK Vhost/Virtio Performance Report Release 18.05

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

DPDK Vhost/Virtio Performance Report Release 18.11

Supporting Fine-Grained Network Functions through Intel DPDK

PacketShader: A GPU-Accelerated Software Router

Total Cost of Ownership Analysis for a Wireless Access Gateway

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

An Experimental review on Intel DPDK L2 Forwarding

High Performance Packet Processing with FlexNIC

Data Path acceleration techniques in a NFV world

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Advanced Computer Networks. End Host Optimization

Learning with Purpose

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

The Transition to PCI Express* for Client SSDs

Programmable NICs. Lecture 14, Computer Networks (198:552)

DPDK Summit China 2017

Ziye Yang. NPG, DCG, Intel

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

DPDK Vhost/Virtio Performance Report Release 17.08

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Intel Workstation Technology

MWC 2015 End to End NFV Architecture demo_

Networking at the Speed of Light

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Intel Open Network Platform. Recep Ozdag Intel Networking Division May 8, 2013

FlexNIC: Rethinking Network DMA

Ben Walker Data Center Group Intel Corporation

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

6.9. Communicating to the Outside World: Cluster Networking

Fast packet processing in linux with af_xdp

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Data Plane Development Kit

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Evolution of the netmap architecture

Xilinx Answer QDMA Performance Report

PacketShader as a Future Internet Platform

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

The Path to DPDK Speeds for AF XDP

DPDK Intel Cryptodev Performance Report Release 18.08

Arrakis: The Operating System is the Control Plane

Agilio CX 2x40GbE with OVS-TC

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

An FPGA-Based Optical IOH Architecture for Embedded System

DPDK Intel Cryptodev Performance Report Release 17.11

Benchmarking and Analysis of Software Network Data Planes

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Measurement-based Analysis of TCP/IP Processing Requirements

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Design and Implementation of Virtual TAP for Software-Defined Networks

Measuring a 25 Gb/s and 40 Gb/s data plane

A Universal Dataplane. FastData.io Project

Interrupt Swizzling Solution for Intel 5000 Chipset Series based Platforms

Achieve Low Latency NFV with Openstack*

Xen Network I/O Performance Analysis and Opportunities for Improvement

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Keeping up with the hardware

Benchmarking Software Data Planes Intel Xeon Skylake vs. Broadwell 1. Maciek Konstantynowicz

TOWARDS FAST IP FORWARDING

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Intel Core TM i7-4702ec Processor for Communications Infrastructure

IsoStack Highly Efficient Network Processing on Dedicated Cores

Netchannel 2: Optimizing Network Performance

Using (Suricata over) PF_RING for NIC-Independent Acceleration

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

OpenDataplane project

MoonGen. A Scriptable High-Speed Packet Generator. Paul Emmerich. January 31st, 2016 FOSDEM Chair for Network Architectures and Services

Recent Advances in Software Router Technologies

Intel Architecture for Software Developers

All product specifications are subject to change without notice.

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

VPP Host Stack. TCP and Session Layers. Florin Coras, Dave Barach, Keith Burns, Dave Wallace

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

High Performance Solid State Storage Under Linux

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

CS3350B Computer Architecture

Intel Speed Select Technology Base Frequency - Enhancing Performance

Intel Open Network Platform Server (Release 1.3) Release Notes

P51: High Performance Networking

Linux multi-core scalability

QorIQ Intelligent Network Interface Card (inic) Solution SDK v1.0 Update

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Transcription:

OpenFlow Software Switch & Intel DPDK performance analysis

Agenda Background Intel DPDK OpenFlow 1.3 implementation sketch Prototype design and setup Results Future work, optimization ideas OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 2

Intel dpdk basics Why Intel DPDK? (www.intel.com/go/dpdk) kernel space implementation is more restricted harder to develop and debug interrupts are still needed performance issues user space implementation over normal Linux kernel is slow user kernel memory separation, copy is slow - some workarounds exist (e.g. pcap mmap), but they are still not fast enough a similar, but less widespread solution: http://info.iet.unipi.it/~luigi/netmap/ Main features poll mode driver: avoid using interrupts and scheduling direct I/O: packet or first X bytes is copied to L1 cache directly some details from the Intel DPDK tutorial will follow OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 3

Intel DPDK Basic Design Designed to run on any Intel architecture CPU Intel Atom to client cores to Sandy Bridge Essential to the IA value proposition PThread to bind h/w thread to s/w task Literally no scheduler overhead User Level Polled Mode Driver No Kernel Context / Interrupt Context Switching Overhead Huge Pages To Improve Performance 1Gig Huge as well as 2 Meg Page support Co-exists with Linux s 4 K Page Low Latency Cache and Memory Access DDIO - Cache Prefetch and rte_cache_aligned - memory 4

Understanding the Choices & Performance Setting the Direction for the Intel DPDK Scheduler (or why not) Hardware threads only No scheduler/task switcher - typical task switch time is between 200+ processor cycles (varies depending on processor architecture) Process bunch of packets at a time Cores process a bunch of packets at a time to amortize some latencies Prefetch Critical to latency hiding since we don t have software threads. Stalls on hardware threads are costly The queue-based model is key to making prefetch effective Locks Generally lockless implementations where-ever possible. A spinlock-unlock pair costs between 60-90 cycles. Queues are lockless (single producer & multi-producer, single consumer) 5

Scheduler or Why not? Primary reason was performance: Task-switch overhead is typically a few hundred cycles FXSAVE/FXRSTOR are 100 and 150 cycles respectively (on Intel NetBurst ) Faster on recent processors, but not significantly Need to add cost of interrupt if pre-emptive To put that in perspective in a 10 GbE environment On a 3 GHz processor, for small (64B) packets, a packet arrives every 67.2 ns = 201 cycles For lower bandwidth environments, an essential thing to think about is the added CPU bandwidth consumed 6

Packet Bunching Done on the NIC today NIC Receive descriptors are bunched four to a cache line Writing back partial descriptors has a severe performance penalty Conflicts between CPU and I/O device on the same cache line Increases memory & PCI-E bandwidth usage Needed to overcome PCI-Express latencies All Intel Ethernet* controllers have settings that can be tweaked to control descriptor write-back Coalesce as many descriptors as possible on Receive Transmit side coalescing done as well (software controlled) Timer values can be set to control latency (EITR) Took the paradigm to the next level in having the fast-path process bunches of packets Facilitated by the queue abstraction * Other names and brands may be claimed as the property of others. 7

Prefetch Two types of prefetch hardware & software Hardware prefetch is issued by the core L1 DCU prefetcher: Streaming prefetcher triggered by ascending access to recently loaded data L1 IP-based strided prefetcher: triggered on individual loads with a stride L2 DPL: Prefetches data into L2 cache based on DCU requests Adjacent cache line (n, n+2, prefetch n+2) Strided prefetcher (e.g. skipped cache lines) Software prefetch needs to be issued appropriately ahead of time to be effective Too early could cause eviction before use Multiple types of software prefetch 8

Paging 1GB super-pages & 2 Meg Huge Page Support Performance implications Primarily due to D-TLB thrashing/replacement Paging performance drop is difficult to gauge really dependent on application Gets significantly worse as memory footprint increases Varies by architecture, but initial measurements suggested ~30% on L3 forwarding Quite often 2-3 D-TLB replacements per packet 9

Intel Data Direct I/O Technology (Intel DDIO) 1x SNB-EP 8C B0, 2.0GHz 10

Intel DPDK Performance IPv4 Layer 3 Forwarding on an IA Server Platform 64B Throughput Mpps 300.00 250.00 200.00 150.00 100.00 50.00 Native Linux Kernel Performance 12 MPPS PS Native Linux Introduction of Integrated Memory Controller + Intel DPDK 42 MPPS Intel DPDK R0.7 55 MPPS PS R1.0 Intel DPDK Release 1.3 SNB @2.1 GHz 1C/1T = 18.6 Mpps SNB @2.7 GHz 1C/1T = 23.7 Mpps Intel DPDK Release 1.4 IVB @2.4 GHz 1C/1T = 23.9 Mpps Introduction of Integrated PCIe* Controller 110 MPPS R1.3 250 MPPS R1.3 1C/2T = 24 Mpps 1C/2T = 28.8 Mpps 1C/2T = 28.5 Mpps 255 MPPS R1.4 0.00 2009 2S Intel Xeon processor E5645 (2x6C Westmere) 2.4GHz 2009 2S Intel Xeon processor E5540 (2x4C Nehalem) 2.53 GHz 2010 2S Intel Xeon processor E5645 (2x6C Westmere-EP) 2.40 GHz 2012 1S Intel Xeon E5-2658 processors C1 Stepping (1x8C Sandy bridge-ep) 2.1 GHz 8 x 10GbE PCIe Gen2 2012 2S Intel Xeon E5-2658 processors C1 Stepping (2x8C Sandy bridge-ep) 2.1 GHz 22 x 10GbE PCIe Gen2 2013 2S Intel Xeon E5-2658v2 (2x10C Ivy bridge-ep) 2.4 GHz 22 x 10GbE PCIe Gen2 Massive IA Performance Improvements since 2009, PCIe Gen3 will Offer Even Better Performance.! Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 11

openflow basics Main idea: programmable networking flexibility, programmability together with high performance Problem: OF is either flexible OR fast today flexible rules with many tuples: use TCAM or slow lookup TCAM is expensive and uses a lot of power complex instructions and actions: high overhead for software implementations some solutions limit flexibility to increase performance (e.g. TTP) In theory performance should only depend on the data plane functions the node is implementing in the given scenario it should be irrelevant whether the device is executing a native implementation of the use case, or is executing OF rules programmed by a controller for the same purpose OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 12

add/rem/mod flow entry table ID Flow Table wildcard/prio lookup Action Set Execution Actions add/rem/mod group entry data access per packet processing data access by control plane data access by internal control liveliness propagation Flow Entry remove dependent flows on removal meter ID remove assoc. flow entry Instructions Group Entry group ID remove assoc. instructions Apply Actions group ID group ID Buckets bucket liveliness port ID Meter meter ID add/rem/mod meter port ID Action Set Port queue ID port liveliness Queue OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 13

Agenda Background OpenFlow 1.3 implementation sketch Intel DPDK Prototype design and setup Results Future work, optimization ideas OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 14

why new prototype Software prototypes investigated and not selected OVS: well-established, open source mainly for virtual environment, performance issues - OVS on Intel DPDK (OVDK) is an ongoing activity CPqD softswitch: used by ONF for prototyping new features, open source serious performance limitations Linc: Erlang based softswitch, open source runs in a VM environment, while we are primarily interested in close to the hardware solutions Hardware based prototypes / products they have serious limitations in terms of number of rules usually OF implementations use TCAM which has limited capacity usually hard to program / modify / add new features OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 15

configuration Simple MAC based forwarding 1, 10, 100, 1000, 2000 and 5000 DMAC rules currently with linear search always the last rule will match caching is not easy instruction = write action action set = Output (egress port) Intel DPDK based generator station (tgen) generates 15 Mpps (@ 64 Bytes / pkt) on one core OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 16

add/rem/mod flow entry table ID Flow Table 1 Flow Entry 2 wildcard/prio lookup Action Set Execution 4 5 Actions add/rem/mod group entry remove dependent flows on removal data access per packet processing data access by control plane data access by internal control liveliness propagation meter ID remove assoc. flow entry Instructions remove assoc. instructions 3 Apply Actions group ID group ID Group Entry Buckets bucket liveliness group ID port ID Meter meter ID add/rem/mod meter port ID Port Queue OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 17 6 7 queue ID Action Set port liveliness

measurement setup 3Com (mgmt) 172.31.32.0/24 1G GENERATOR Intel Xeon E5-2630 2x6 cores @ 2.3 GHz 8x4 GB DDR3 SDRAM Intel Niantic (82599EB) 2x10 GbE 10G 10G 1G OF-SW Intel Xeon E5-2630 2x6 cores @ 2.3 GHz 8x4 GB DDR3 SDRAM OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 18

Linux OF-SW core0 core3 core4 OF code core5 OF code core1 rx tx rx tx rx tx core2 Linux driver Intel DPDK driver q42 q52 q43 q53 ETH0 (1G) ETH1 (1G) ETH2 (10G) ETH3 (10G) ETH2 (10G) ETH3 (10G) GENERATOR OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 19

results Main results: 25% overhead vs. L2FWD (Intel s example) it was more without highly optimizing the software Pkt size L2FWD OF 1.3 1 rule 100000 Mpps Gbps Mpps Gbps 64 13.82 7.08 10.27 5.26 128 8.45 8.65 8.45 8.65 256 4.53 9.28 4.53 9.28 512 2.35 9.63 2.35 9.63 linear with nr. of rules (not surprisingly) 10000 1000 100 10 1 10 100 500 1000 2000 Performance (kpps) Processing time (ns) So we began some investigation OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 20

some details Processing time per number of rules at small number of rules cache(s) are effectively used note that real traffic would behave better 15 14 13 12 11 10 9 8 7 6 5 4 10 100 500 1000 2000 5000 time per rule (ns) OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 21

improving further Preliminary results of current code: overhead was completely removed 16 15 14 13 12 11 L2FWD OF 1.3 1 rule OF 1.3 ++ 10 9 8 OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 22

and further Current status: basically removed static OF overhead It s time for improving rule processing speed and implement control plane Basic ideas under discussion high-performance southbound interface minimize the need for locking, timeouts, etc. fast data plane execution flow caching lookup algorithm selection, selective TTP usage prediction OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 23