L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4.

Similar documents
PacketShader: A GPU-Accelerated Software Router

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

SmartNIC Programming Models

Providing Multi-tenant Services with FPGAs: Case Study on a Key-Value Store

The Power of Batching in the Click Modular Router

SmartNIC Programming Models

G-NET: Effective GPU Sharing in NFV Systems

CSE 123A Computer Networks

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DPDK Summit China 2017

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

G-NET: Effective GPU Sharing In NFV Systems

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

The dark powers on Intel processor boards

On the cost of tunnel endpoint processing in overlay virtual networks

NetFPGA Hardware Architecture

GPGPU introduction and network applications. PacketShaders, SSLShader

Experience with the NetFPGA Program

P51: High Performance Networking

An FPGA-based In-line Accelerator for Memcached

Lecture 16: Router Design

Introduction to the OpenCAPI Interface

Programming Netronome Agilio SmartNICs

Design principles in parser design

Hillstone IPSec VPN Solution

RouteBricks: Exploiting Parallelism To Scale Software Routers

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

vswitch Acceleration with Hardware Offloading CHEN ZHIHUI JUNE 2018

Packet Manipulator Processor: A RISC-V VLIW core for networking applications

Routers Technologies & Evolution for High-Speed Networks

P4FPGA Expedition. Han Wang

Day 2: NetFPGA Cambridge Workshop Module Development and Testing

NetFPGA Update at GEC4

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

CMU /618 Practice Exercise 1

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

High Performance Packet Processing with FlexNIC

Programming NFP with P4 and C

Higher Level Programming Abstractions for FPGAs using OpenCL

Trying to design a simple yet efficient L1 cache. Jean-François Nguyen

This document provides an overview of buffer tuning based on current platforms, and gives general information about the show buffers command.

Lecture 17: Router Design

GPUs have enormous power that is enormously difficult to use

Router Architectures

The Nios II Family of Configurable Soft-core Processors

Overview of ROCCC 2.0

Design of a Web Switch in a Reconfigurable Platform

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Cloud Networking (VITMMA02) Network Virtualization: Overlay Networks OpenStack Neutron Networking

Parallelizing IPsec: switching SMP to On is not even half the way

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Scalability Considerations

LegUp: Accelerating Memcached on Cloud FPGAs

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

Open Source Traffic Analyzer

INT 1011 TCP Offload Engine (Full Offload)

Tile Processor (TILEPro64)

Much Faster Networking

Scrypt ASIC Prototyping Preliminary Design Document

Network Processors Outline

Making Network Functions Software-Defined

Extreme TCP Speed on GbE

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Comparing TCP performance of tunneled and non-tunneled traffic using OpenVPN. Berry Hoekstra Damir Musulin OS3 Supervisor: Jan Just Keijser Nikhef

Flexible Architecture Research Machine (FARM)

Networking at the Speed of Light

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Programmable NICs. Lecture 14, Computer Networks (198:552)

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

An Intelligent NIC Design Xin Song

OVS Acceleration using Network Flow Processors

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Today s Data Centers. How can we improve efficiencies?

Ultra-Fast NoC Emulation on a Single FPGA

AN 831: Intel FPGA SDK for OpenCL

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Introduction to Routers and LAN Switches

Table of Contents. Cisco Buffer Tuning for all Cisco Routers

ntop Users Group Meeting

Zilog Real-Time Kernel

Netronome NFP: Theory of Operation

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Did I Just Do That on a Bunch of FPGAs?

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

NAT Router Performance Evaluation

Computer Networks CS 552

Netchannel 2: Optimizing Network Performance

Overlay Engine. VNS3 Plugins Guide 2018

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Transcription:

Click! N P

10.3.0.1 10.3.1.1 Eth33 Eth1-4 Eth1-4 C0 C1 Eth1-2 Eth1-2 Eth1-2 Eth1-2 10.2.0.1 10.2.1.1 10.2.2.1 10.2.3.1 Eth3-4 Eth3-4 Eth3-4 Eth3-4 L0 L1 L2 L3 Eth24-25 Eth1-24 Eth24-25 Eth24-25 Eth24-25 Eth1-24 Eth1-24 Eth1-24 T0 T1 T2 T3 2

Network function Implementation 1500B pkt @ 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch 5 100 Firewall (8K rules) Linux iptables 21 480 64B pkt @ 40 Gbps (worst-case estimate) 3

Network function Implementation 1500B pkt @ 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch 5 100 Firewall (8K rules) Linux iptables 21 480 64B pkt @ 40 Gbps (worst-case estimate) 4

5

6

88 h68656c6c6f20776f726c64 Ahhhhhhhhhhhh! 7

Click! N P language fully programmable using high-level Click abstractions familiar to software developers; easy code reuse high throughput; microsecond-scale latency FPGA is no panacea; fine-grained processing separation 8

A B C 9

(reg/mem) (I/O) (main thread) (I/O) (ISR) (interrupt) 10

Element A (FPGA) Element B (CPU) PCIe I/O channel 11

Verilog code (.v) 12

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 13

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 14

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 15

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 16

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 17

ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 18

CPU logger ClickNP Configuration: 19

Count element: CPU logger ClickNP Configuration: 20

21

22

Input pkt Input + + + Output s += pkt[0] s += pkt[1] s += pkt[2] Output s Input + + + Output Input + + + Output Input + + + Output Input + + + Output 23

Read input Read Inc Write Read Inc Write Read mem Read Inc Write Increment Write mem Read read write write Write out 24

Read input Read input Memory read and write can operate in parallel: Read in.addr, Write buf.addr Different memory addresses! Read mem Increment Read buf in.addr = buf.addr? Read mem Write mem Write mem Write out Increment Write buf Write out Delayed write: Buffer new data in a register Delay memory write until next read 25

Read Cache Hit? Cache Read DRAM Output Cache Read DRAM Output Cache Read Read DRAM Output 26

Read Cache Read Cache From fast path Hit? Read DRAM Hit? To slow path From slow path Read DRAM To fast path Output Output 27

Cache Output Cache Output Cache Output Cache Output Cache To slow Read DRAM To fast Output Cache Output 28

Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, tunnel encap/decap, crypto, hash tables, prefix matching, packet scheduling, rate limiting Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 29

Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 30

Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 31

Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 32

Network Function Lines of Code * Number of Elements Resource LE % Pkt generator 13 6 16% 12% Pkt capture 12 11 8% 5% OpenFlow firewall 23 7 32% 54% IPSec gateway 37 10 35% 74% L4 load balancer 42 13 36% 38% pfabric scheduler 23 7 11% 15% Resource BRAM % 33

34

35

36

scheduler pkt 1 pkt n 37

scheduler pkt 1 pkt n 38

ClickNP StrongSwan / Linux (out of the box) Throughput 37.8 Gbps 628 Mbps Latency 13 us (stable) 50us ~ 5ms 39

Nexthop allocation CPU element 40

Nexthop allocation CPU element 41

42

NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 43

NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 44

NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 45

Click! N P 46

Click! N P

48

49

GPU NP FPGA Throughput High High High Latency High Low Low Power High Low Low General computing Yes No Yes 50

51

52

Define elements Define a configuration of elements Host manager program Windows/Linux, Altera/Xilinx 53

A B C Communicate by sharing memory Shared memory is the bottleneck! Batch processing has large latency! 54

A B C Do not communicate by sharing memory; instead, share memory by communicating. -- The slogan of Go language 55

Read key Check key Read counter Read Check Read Inc Write Read Check Read Inc Wr Increment R1 C1 R2 I2 W2 Write counter R1 C1 R2 I2 W2 56

Input Input Input sum i<4 Input Cksum Cksum Input Cksum sum sum += pkt[0] Cksum Cksum i<4 sum sum += pkt[1] Cksum Output Cksum Cksum sum += pkt[i] i<4 sum += pkt[2] Output i<4 sum sum += pkt[3] Output i<4 Output Input Output 57

Read Read Cache Hit? Slow path Slow element Output Read Read Cache Slow path Slow path Hit? To slow element Slow path Output Output Output Output 58