PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Similar documents
Backend for Software Data Planes

PISCES: A Programmable, Protocol-Independent Software Switch

T4P4S: When P4 meets DPDK. Sandor Laki DPDK Summit Userspace - Dublin- 2017

PISCES:'A'Programmable,'Protocol4 Independent'So8ware'Switch' [SIGCOMM'2016]'

Programmable Packet Processing With

BESS: A Virtual Switch Tailored for NFV

Linux Network Programming with P4. Linux Plumbers 2018 Fabian Ruffy, William Tu, Mihai Budiu VMware Inc. and University of British Columbia

SmartNIC Programming Models

SmartNIC Programming Models

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

P51: High Performance Networking

Design and Implementation of Virtual TAP for Software-Defined Networks

FastReact. In-Network Control and Caching for Industrial Control Networks using Programmable Data Planes

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

The Power of Batching in the Click Modular Router

P4 Introduction. Jaco Joubert SIGCOMM NETRONOME SYSTEMS, INC.

A Universal Dataplane. FastData.io Project

Improve Performance of Kube-proxy and GTP-U using VPP

Arrakis: The Operating System is the Control Plane

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

Benchmarking of VPP. Arijit Pramanik RnD Project

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Programmable Data Plane at Terabit Speeds

DPDK Intel NIC Performance Report Release 18.02

Accelerate Cloud Native with FD.io

OpenFlow Software Switch & Intel DPDK. performance analysis

Programmable data planes, P4, and Trellis

Comparing Open vswitch (OpenFlow) and P4 Dataplanes for Agilio SmartNICs

TOWARDS FAST IP FORWARDING

DPDK Intel NIC Performance Report Release 18.05

DPDK Performance Report Release Test Date: Nov 16 th 2016

Tutorial S TEPHEN IBANEZ

All product specifications are subject to change without notice.

DPDK Intel NIC Performance Report Release 17.08

P4FPGA Expedition. Han Wang

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Agilio CX 2x40GbE with OVS-TC

New Approach to OVS Datapath Performance. Founder of CloudNetEngine Jun Xiao

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Accelerating Contrail vrouter

Virtual Switch Acceleration with OVS-TC

Open vswitch Extensions with BPF

Link Virtualization based on Xen

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

High Performance Packet Processing with FlexNIC

Demystifying Network Cards

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Accelerating vrouter Contrail

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Total Cost of Ownership Analysis for a Wireless Access Gateway

It's kind of fun to do the impossible with DPDK Yoshihiro Nakajima, Hirokazu Takahashi, Kunihiro Ishiguro, Koji Yamazaki NTT Labs

Programmable Forwarding Planes at Terabit/s Speeds

Introduc)on to P4 Programming Protocol-Independent Packets Processors. Ronald van der Pol SURFnet

XDP now with REDIRECT

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

double split driver model

Configuring Tap Aggregation and MPLS Stripping

Supporting Fine-Grained Network Functions through Intel DPDK

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Programming Hardware Tables John Fastabend (Intel) Netdev0.1

Dataplane Programming

Overview of the Cisco OpenFlow Agent

Networking at the Speed of Light

Experiences with Programmable Dataplanes

Data Center Virtualization: VirtualWire

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

P4 Language Tutorial. Copyright 2017 P4.org

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

CS344 Lecture 2 P4 Language Overview. Copyright 2018 P4.org

Network Virtualization in Multi-tenant Datacenters

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Intro to HW Design & Externs for P4àNetFPGA. CS344 Lecture 5

NFVnice: Dynamic Backpressure and Scheduling for NFV Service Chains

Building a Platform Optimized for the Network Edge

An Analysis and Empirical Study of Container Networks

MWC 2015 End to End NFV Architecture demo_

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

Programming Netronome Agilio SmartNICs

Dynamic Compilation and Optimization of Packet Processing Programs

P4GPU: A Study of Mapping a P4 Program onto GPU Target

Programmable Data Plane at Terabit Speeds Vladimir Gurevich May 16, 2017

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

Programmable NICs. Lecture 14, Computer Networks (198:552)

G-NET: Effective GPU Sharing In NFV Systems

P4 Runtime: Putting the Control Plane in Charge of the Forwarding Plane. Presented by :

Yahoo Search ATS Plugins. Daniel Morilha and Scott Beardsley

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

ONOS-P4 Tutorial Hands-on Activity. P4 Brigade Work Days, Seoul (Korea) September 18-29, 2017

1 Gbps and 10 Gbps WAN Emulator IPNetSim

Programming Network Data Planes

Enterprise Network Compute System (ENCS)

FaRM: Fast Remote Memory

Understanding The Performance of DPDK as a Computer Architect

Transcription:

PVPP: A Programmable Vector Packet Processor Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Fixed Set of Protocols Fixed-Function Switch Chip TCP IPv4 IPv6 Ethernet UDP HTTP BGP TLS

Custom Protocols Programmable Switching Chip TCP IPv4 IPv6 Ethernet CUSTOM_P HTTP BGP TLS

VM VM Software Switch 3 Virtual Ports 1 Physical Port

60 Approx. Number of Physical Ports vs. Virtual Ports [1] 40 20 0 2010 2011 2012 2013 2014 2015 Phyical Ports Virtual Ports [1] Martin Casado, VMWorld 2013

Custom Protocols Software Switch TCP IPv4 IPv6 Ethernet CUSTOM_P BGP HTTP PISCES [1] [1] PISCES. ACM SIGCOMM 2016. BMv2 [2] [2] https://github.com/p4lang/behavioral-model

Throughput on Eth + IPv4 + ACL benchmark application [1] PISCES v0.1 PISCES v1.0 Native OVS Throughput (Gbps) 16 14 12 10 8 6 4 2 0 7.59 [1] PISCES. ACM SIGCOMM 2016. 13.32 13.43 64 Packet Size (Bytes) Performance overhead of < 2%

So why ANOTHER P4 software switch?

Initially, the switching chip is not programmed and does not know any protocols. Parser Match+Action Tables Queues/ Scheduling Packet Metadata

Protocol Authoring L2_L3.p4 Switch OS Compile Eth VLAN Run-time API Driver Configure IPv4 IPv6 TCP Parser New Match+Action Tables Queues/ Scheduling Packet Metadata

Protocol Authoring L2_L3.p4 OF1-3.p4 Switch OS Compile Configure Run-time API Driver Parser Match+Action Tables Queues/ Scheduling Packet Metadata

Software Switch Parser Match-Action Pipeline Kernel DPDK

Domain-Specific Language (DSL) Parser Match-Action Pipeline Compile Software Switch Parser Match-Action Pipeline Kernel DPDK

PISCES P4 to OvS BMv2 P4 to a C++ custom switch DSL 1 DSL 2 Parser Parser Software Switch Software Switch 2 Parser Parser Match-Action Pipeline Match-Action Pipeline Compile Match-Action Pipeline Match-Action Pipeline Kernel Kernel DPDK DPDK

What s wrong with this design?

Not designed for CPU based architectures Limited in expressiveness Limited APIs to access low level constructs => Lot of room for improvements!

Vector Packet Processing (VPP) Platform Open source version of Cisco s Vector Packet Processing technology Modular packet processing node graph abstraction Each node processes a vector of packets to reduce CPU I-cache thrashing Extensible and dynamically reconfigurable via plugins

Vector Packet Processing (VPP) Platform Proven Performance [1] Multiple MPPS from a single x86_64 core 1 core: 9 MPPS ipv4 in+out forwarding 2 cores: 13.4 MPPS ipv4 in+out forwarding 4 cores: 20.0 MPPS ipv4 in+out forwarding > 100Gbps full-duplex on a single physical host Outperforms Open vswitch in various scenarios [1] https://wiki.fd.io/view/vpp/what_is_vpp%3f

Packet Vector dpdk-input ip4-input ip6-input llc-input ip6-lookup ip6-rewritetransmit dpdk-output

Packet Vector dpdk-input Custom-input ip4-input ip6-input llc-input Node 1 Node 2 Node i ip6-lookup Node j Custom Plugin Node k ip6-rewritetransmit Vanilla VPP Nodes dpdk-output

Packet Vector dpdk-input Enabled via CLI Custom-input ip4-input ip6-input llc-input Node 1 Node 2 Node i ip6-lookup Node j Custom Plugin Node k ip6-rewritetransmit Vanilla VPP Nodes dpdk-output

PVPP Overview Creates a plugin based on the input P4 program No changes to existing VPP code base Compiles either single node or multiple node plugin Multiple nodes are split by number of tables in the input P4 program P4 programs can be swapped dynamically

Packet Vector dpdk-input Enabled via CLI pvpp-input ip4-input ip6-input llc-input ip6-lookup Table 1 Table 2 Table i Table j Table k Multi-Node PVPP Plugin ip6-rewritetransmit Vanilla VPP Nodes dpdk-output

P4 Program VPP Plugin Cog Templates Front-end Compiler BMv2 Mid-end Compiler BMv2 Back-end Compiler JSON JSON-VPP Compiler P4 Compiler (P4C) C Files VPP Plugin Directory

Details of PVPP Plugin Headers are defined as C structs header_type ethernet_t { fields { dstaddr: 48; srcaddr: 48; ethertype: 16; } } typedef struct { u8 dstaddr[6]; u8 srcaddr[6]; u16 ethertype; } p4_type_ethernet_h; Action interface takes pointers to all header, metadata, runtime data and compiler selects the correct pointer and set of primitives to perform on the data.

Details of PVPP Plugin A table definition contains two parts 1. A match definition that defines the type of match (EXACT, LPM) and which fields to match with 2. A action definition which contains set of action pointers corresponding to the match result

PVPP CLI Two CLIs are currently supported 1. Enable/Disable PVPP Pipeline $ pvpp [ingress interface name] 2. CLI to install match rules for a particular table $ pvpp insert-rule [table name] [match value] [action name] [runtime data]

Experimental Setup MoonGen Sender/Receiver 10Gx3 PVPP DPDK 10Gx3 MoonGen Sender/Receiver M1 M2 M3 CPU: Intel Xeon E5-2640 v3 2.6GHz Memory: 32GB RDIMM, 2133 MT/s, Dual Rank NICs: Intel X710 DP/QP DA SFP+ Cards HDD: 1TB 7.2K RPM NLSAS 6Gbps

Benchmark Application Parse Ethernet/ IPv4 IPv4_match Match: ip.dstaddr Action: Set_nhop drop Destination MAC Match: ip.dstaddr Action: Set_dmac drop Source MAC Match: egress_port Action: Set_dmac drop

Baseline Performance Single Node Multiple Node Throughput (Mpps) 9 8 7 6 5 4 3 2 1 0 7.86 7.05 64 Packet Size (Bytes)

Compiler optimizations Remove redundant tables Reducing metadata access Bypassing redundant VPP nodes Reduce pointer dereference Caching logical HW interfaces Unrolling loops for multiple packet processing

Loop Unrolling Manually fetches two packets

Throughput (Mpps) Optimized Performance Single Node Multiple Node 64 byte packets, single 10G port 12 10.01 10.21 10 9.25 9.51 9.51 9.58 8.38 8.50 8.80 8.89 9.02 9.20 7.86 8 7.05 6 4 2 0 Baseline Removing Redundant Tables Reducing Metadata Access Loop Unrolling Bypassing Redundant Nodes Reducing Pointer Dereferences Caching Logical HW Interface

Optimized Performance Single Node Multiple Node Throughput (Mpps) 12 10 8 6 4 10.21 9.20 8.07 8.07 5.63 5.65 4.38 4.38 2 0 64 128 192 256 Packet Size (Bytes)

Optimized Performance Single Node Multiple Node Throughput (Mbps) 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 64 128 192 256 Packet Size (Bytes)

Optimized Performance Single Node Multiple Node Average CPU Cycles per Packet 300 250 200 150 100 50 133.00 159.00 149.00 172.30 171.00 222.30 194.00 255.20 0 64 128 192 256 Packet Size (Bytes)

Scalability 64 byte packets across 3 x 10G ports Single Node Multiple Node Throughput (Mpps) 60 50 40 30 20 10 8.52 8.14 17.03 16.57 26.40 24.14 35.83 33.41 44.23 40.69 53.11 49.34 0 1 2 3 4 5 6 Number of CPUs

Performance Comparison PVPP PISCES (with Microflow) PISCES (without Microflow) Throughput (Mpps) 70 60 50 40 30 20 59.53 63.49 49.31 47.23 34.71 34.72 30.22 30.22 30.20 26.78 26.78 26.78 10 0 64 128 192 256 Packet Size (Bytes)

Future Work Automated node splits based on the input program More compiler annotations for low level constructs Extending P4 support such as data plane states VPP specific P4_16 backend compiler Extending PVPP CLI features

Summary P4 PVPP - A performant and dynamically reconfigurable P4 switch based on a different packet processing abstraction - More improvements planned over the summer prior to public release VPP

Questions?