Programmable Data Plane at Terabit Speeds Vladimir Gurevich May 16, 2017

Similar documents
Programmable Data Plane at Terabit Speeds

P51: High Performance Networking

Programmable Forwarding Planes at Terabit/s Speeds

What s happening in the Networking Landscape?

Cisco Nexus 3000 Switch Architecture

FastReact. In-Network Control and Caching for Industrial Control Networks using Programmable Data Planes

P4 Language Design Working Group. Gordon Brebner

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

A 400Gbps Multi-Core Network Processor

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

NetChain: Scale-Free Sub-RTT Coordination

Cisco Nexus 9508 Switch Power and Performance

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Cisco Nexus 9500 Series Switches Architecture

Experiences with Programmable Dataplanes

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

P4 16 Portable Switch Architecture (PSA)

Tutorial S TEPHEN IBANEZ

Scaling Hardware Accelerated Network Monitoring to Concurrent and Dynamic Queries with *Flow

Cisco Nexus 9500 Series Switches Buffer and Queuing Architecture

CSE 123A Computer Networks

P4FPGA Expedition. Han Wang

Professor Yashar Ganjali Department of Computer Science University of Toronto.

PISCES:'A'Programmable,'Protocol4 Independent'So8ware'Switch' [SIGCOMM'2016]'

Cisco Nexus 9300-EX Platform Switches Architecture

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

ONOS Support for P4. Carmelo Cascone MTS, ONF. December 6, 2018

Configuring Online Diagnostics

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Managing and Securing Computer Networks. Guy Leduc. Chapter 2: Software-Defined Networks (SDN) Chapter 2. Chapter goals:

Scaling routers: Where do we go from here?

Leveraging Stratum and Tofino Fast Refresh for Software Upgrades

P4 16 Portable Switch Architecture (PSA)

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Cisco Nexus 7000 Switch Architecture

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

P51: High Performance Networking

P4 for an FPGA target

Network Processors. Nevin Heintze Agere Systems

Next Generation Architecture for NVM Express SSD

Cisco ASR 1000 Series Aggregation Services Routers: QoS Architecture and Solutions

Cisco Nexus 7000 Hardware Architecture

Rocker: switchdev prototyping vehicle

H3C S5130-EI Switch Series

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

openstate.p4 Supporting Stateful Forwarding in P4 Antonio Capone, Carmelo Cascone

TRANSPUTER ARCHITECTURE

Disruptive Innovation in ethernet switching

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Configuring OpenFlow 1

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

P4 Language Tutorial. Copyright 2017 P4.org

OpenFlow Software Switch & Intel DPDK. performance analysis

Software Defined Networking Data centre perspective: Open Flow

Monitoring Ports. Port State

Simplifying FPGA Design for SDR with a Network on Chip Architecture

IOS XR Platform Hardware Architectures

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

T4P4S: When P4 meets DPDK. Sandor Laki DPDK Summit Userspace - Dublin- 2017

drmt: Disaggregated Programmable Switching (Extended Version)

Overview of the Cisco OpenFlow Agent

Language-Directed Hardware Design for Network Performance Monitoring

P4 Introduction. Jaco Joubert SIGCOMM NETRONOME SYSTEMS, INC.

p4v: Practical Verification for Programmable Data Planes

Chapter 1. Introduction

CS344 Lecture 2 P4 Language Overview. Copyright 2018 P4.org

nforce 680i and 680a

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Programmable NICs. Lecture 14, Computer Networks (198:552)

The P4 Language Specification

Evaluating the Power of Flexible Packet Processing for Network Resource Allocation

Cisco Nexus 6000 Architecture

Cisco Nexus 7000 / 7700 Switch Architecture

Internet Routers Past, Present and Future

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

High Performance Memory Opportunities in 2.5D Network Flow Processors

Contents. Introduction. Background Information. Terminology. ACL TCAM Regions

KeyStone Training. Multicore Navigator Overview

Automated Test Case Generation from P4 Programs

Hello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

Understanding and Configuring Switching Database Manager on Catalyst 3750 Series Switches

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Lecture 13 Input/Output (I/O) Systems (chapter 13)

High-Performance Network Data-Packet Classification Using Embedded Content-Addressable Memory

Netronome NFP: Theory of Operation

Lab 2: P4 Runtime. Copyright 2018 P4.org

drmt: Disaggregated Programmable Switching

Line Cards and Physical Layer Interface Modules Overview, page 1

An Introduction to the QorIQ Data Path Acceleration Architecture (DPAA) AN129

Advanced Network Tap application for Flight Test Instrumentation Systems

Extending the range of P4 programmability

Backend for Software Data Planes

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

Programming Network Data Planes

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Router Architecture Overview

High Performance Packet Processing with FlexNIC

Introduction to Routers and LAN Switches

Programmable data planes, P4, and Trellis

Transcription:

Programmable Data Plane at Terabit Speeds Vladimir Gurevich May 16, 2017

Programmable switches are 10-100x slower than fixed-function switches. They cost more and consume more power. Conventional wisdom in networking

Moore s Law and Simple Math 3

Serial I/O: About 30% of switch chip area Intel Alta (2011) Cisco (2011) Broadcom Tomahawk (2014) Ericsson Spider (2011) Barefoot Tofino (2016)

30% Serial I/O 50% Lookup Tables Packet Buffer 20% Logic Packet Processing

30% Serial I/O 50% Lookup Tables Packet Buffer 20% Logic Packet Processing

30% Serial I/O Only the Logic dictates whether fixed function or programmable 50% Lookup Tables Packet Buffer 20% Logic Packet Processing

30% Serial I/O Only the Logic dictates whether fixed function or programmable 50% Lookup Tables Packet Buffer 20% Logic Packet Processing

30% Serial I/O Only the Logic dictates whether fixed function or programmable 50% Lookup Tables Packet Buffer 20% Logic Packet Processing

30% Serial I/O Only the Logic dictates whether fixed function or programmable 50% Lookup Tables Packet Buffer 20% Logic Packet Processi ng

Parallelism and alternatives Sequential semantics does not prohibit parallelism Doing everything does not mean doing everything all the time 11

PISA: Important Details Programmable Parser Programmable Match-Action Pipeline Programmable Deparser Match+Action Stage (Unit) 12

PISA: Important Details Multiple simultaneous lookups and actions can be supported N lookups M actions Match+Action Stage (Unit) 13

PISA: Match and Action are Separate Phases Sequential Execution (Match dependency) Total Latency = 3 14

PISA: Match and Action are Separate Phases Sequential Execution (Match Dependency) Staggered Execution (Action Dependency) Total Latency = 2 15

PISA: Match and Action are Separate Phases Sequential Execution (Match Dependency) Staggered Execution (Action Dependency) Parallel Execution (No Dependencies) Total Latency = 1.1 16

PISA: Match is not required A sequence of actions is an imperative program 17

Symmetric Switch Model Why not share the hardware between ingress and egress? Packet Queuing, Replication & Scheduling 18

Introducing Tofino 19

6.5Tb/s Tofino TM Summary State of the art design Single Shared Packet Buffer TSMC 16nm FinFET+ MAC + Serial I/O Four Match+Action Pipelines Fully programmable PISA Embodiment All compiled programs run at line-rate. Up to 1.3 million IPv4 routes Port Configurations 65 x 100GE/40GE 130 x 50GE 260 x 25GE/10GE CPU Interfaces PCIe: Gen3 x4/x2/x1 Dedicated 100GE port Match+Action Pipeline 0 Shared Packet Buffer Match+Action Pipeline 2 & TM Match+Action Pipeline 1 Match+Action Pipeline 3 20

Tofino. Simplified Block Diagram Reset / Clocks PCIe CPU MAC DMA engines Control & configuration Rx MACs 10/25/40/50/100 Ingress Pipeline Egress Pipeline Tx MAC 10/25/40/50/100 pipe 0 Rx MACs 10/25/40/50/100 Rx MACs 10/25/40/50/100 Ingress Pipeline Ingress Pipeline Traffic Manager Egress Pipeline Egress Pipeline Tx MAC 10/25/40/50/100 Tx MAC 10/25/40/50/100 pipe 1 pipe 2 Rx MACs 10/25/40/50/100 Ingress Pipeline Egress Pipeline Tx MAC 10/25/40/50/100 pipe 3 Each pipe has 16x100G MACs + a Packet Additional ports for recirculation, Packet Generator, CPU

P4 16 Language Elements Parsers State machine, bitfield extraction Controls Expressions Data Types Tables, Actions, control flow statements Basic operations and operators Bistrings, headers, structures, arrays Architecture Description Extern Libraries Programmable blocks and their interfaces Support for specialized components 22

Packet Header Vector (PHV) A set of uniform containers that carry the headers and metadata along the pipeline Fields can be packed into any container or their combination PHV Allocation step in the compiler decides the actual packing 8 8 16 16 32 8-bit words 16-bit words 32-bit words 32 23

Unified Pipeline There is no difference between ingress and egress processing The same blocks can be efficiently shared Reset / Clocks PCIe CPU MAC DMA engines Control & configuration Rx MACs 10/25/40/50/100 Tx MAC 10/25/40/50/100 Ingress Pipeline Egress Pipeline pipe 0 Traffic Manager Rx MACs 10/25/40/50/100 Ingress Pipeline pipe 3 Tx MAC 10/25/40/50/100 Egress Pipeline 24

The Basic Structure 1/10/ 1/10/ 40/100G 40/100G Rx Rx MACs MACs Ingress Parser Buffers header Ingress Match-Action Pipeline header Deparser Common Queuing and Packet Data Buffers full packet 25

Parser TCAM SRAM Data state next state Action Match Field Extract next state Match Field selection 8b 16b Packet Header Vector Input Shift Register shift control 32b From MAC Output Field extract register 26

Parser Visualization 27

Pipeline Organization Egress header Combined Ingress/Egress Match-Action Pipeline Egress Packet body Egress Parser 0 Egress Parser M MAU 0 PHV MAU n PHV Egress Deparser Egress Packet Constructor MAC MAC MAC MAC MAC MAC MAC MAC Ingress Buffer Ingress Parser 0 Ingress Parser M Ingress Deparser Ingress Packet Constructor Queues And Packet Buffers Recirculation Buffer (Packet Reference, Metadata) Ingress Packet body 28

What Happens Inside? PHV (Pkt HdrPHV Vector) (Pkt HdrPHV Vector) (Pkt HdrPHV Vector) (Pkt HdrPHV Vector) (Pkt HdrPHV Vector) (Pkt HdrCross Vector) Bar Bar Cross Bar Cross Bar Cross Bar Cross Bar Match Table Match Table (SRAM Match or TCAM) Table (SRAM Match or TCAM) Table (SRAM Match or TCAM) Table (SRAM Match or TCAM) Table (SRAM or TCAM) (SRAM or TCAM) PHV PHV action PHV action PHV action PHV action PHV action action Match+Action Stage (Unit)

Parallelism in P4 apply { /* Parallel lookups possible */ subnet_vlan.apply(); mac_vlan.apply(); protocol_vlan.apply() port_vlan.apply(); /* Resolution in next stage */ resolve_vlan.apply(); Most P4 programs have inherent parallelism Others can be executed speculatively Switch.p4 ~100 tables and if() statements ~22 stages divided between ingress and egress Degree of parallelism ~4.5 apply { if (!subnet_vlan.apply().hit) { if (!mac_vlan.apply().hit) { if (!protocol_vlan.apply().hit) { port_vlan.apply(); 30

How Tofino Supports Parallel Processing Multiple tables mean multiple parallel lookups All actions from all active tables are combined N lookups Match+Action Stage (Unit) M actions Packet Header Vector Exact Match Xbar Ternary Match Xbar Xkey0 XkeyN Tkey0 TkeyN Exact Table 0 Exact Table N Ternary Table 0 Ternary Table N Action Action Action Action Action Engine PHV Match Action Unit 31

Parallelism in P4 action ipv4_in_mpls(in bit<20> label1, in bit<20> label2) { hdr.mpls[0].setvalid(); hdr.mpls[0].label = label1; hdr.mpls[0].exp = 0; hdr.mpls[0].bos = 0; hdr_mpls[0].ttl = 64; hdr.mpls[1].setvalid(); hdr.mpls[1] = { label2, 0, 1, 128 ; if (hdr.vlan_tag.isvalid()) { hdr.vlan_tag.ethertype = 0x8847; else { hdr.ethernet.ethertype = 0x8847; Most actions can be easily parallelized This action can be executed in 1 cycle Number of parallel operations: 12 Keep fields in separate containers 8 8 16 16 32 8-bit words 16-bit words 32-bit words 32 32

P4 Visualizations (PHV Allocation) 33

How Tofino Supports Parallel Processing Multiple tables mean multiple parallel lookups All actions from all active tables are combined Each PHV container has its own, independent processor { opcode, operands N lookups M actions 0 0 1 2 3 4 5 M-3 M-2 M-1 M 1 0 1 2 3 4 5 M-3 M-2 M-1 M 0 1 2 3 4 5 M-3 M-2 M-1 M Match+Action Stage (Unit) 34

Dependency Analysis Independent tables Match Dependency Action Dependency 35

Independent Tables action ing_drop() { modify_field(ing_metadata.drop, 1); action set_egress_port(egress_port) { modify_field(ing_metadata.egress_spec, egress_port); table dmac { reads { ethernet.dstaddr : exact; actions { nop; set_egress_port; Tables are independent: both matching and action execution can be done in parallel table smac_filter { reads { ethernet.srcaddr : exact; actions { nop; ing_drop; control ingress { apply(dmac); apply(smac_filter); Parser smac filter dmac smac filter dmac 36

Match and Action are Separate Phases Parallel Execution (No Dependencies) Total Latency = 1.1 37

Action Dependency action ing_drop() { modify_field(ing_metadata.drop, 1); action set_egress_port(egress_port) { modify_field(ing_metadata.egress_spec, egress_port); Tables act on the same field and therefore must be placed in separate stages table dmac { reads { ethernet.dstaddr : exact; actions { nop; ing_drop; set_egress_port; table smac_filter { reads { ethernet.srcaddr : exact; actions { nop; ing_drop; control ingress { apply(dmac); apply(smac_filter); Parser dmac dmac smac filter smac filter 38

Match and Action are Separate Phases Staggered Execution (Action Dependency) Total Latency = 2 39

Match Dependency action set_bd(bd) { modify_field(ing_metadata.bd, bd); table port_bd { reads { ing_metadata.ingress_port : exact; actions { set_bd; table dmac { reads { ethernet.dstaddr : exact; ing_metadata.bd : exact; actions { nop; set_egress_port; The second table matches on the field, modified by the first. table smac_filter { reads { ethernet.srcaddr : exact; actions { nop; ing_drop; control ingress { apply(port_bd); apply(dmac); apply(smac_filter); Parser port bd smac filter port bd smac filter dmac smac filter dmac smac filter 40

Reducing Latency: Match and Action are Separate Phases Sequential Execution (Match dependency) Total Latency = 3 41

P4 Visualizations (Resource Allocation) 42

P4 Visualizations (Resource Usage Summary) 43

Conclusions Programmable Data Plane at the highest performance point is a reality now P4 and real-world programs provide a plenty of optimization opportunities 44

Thank you