Click! N P
10.3.0.1 10.3.1.1 Eth33 Eth1-4 Eth1-4 C0 C1 Eth1-2 Eth1-2 Eth1-2 Eth1-2 10.2.0.1 10.2.1.1 10.2.2.1 10.2.3.1 Eth3-4 Eth3-4 Eth3-4 Eth3-4 L0 L1 L2 L3 Eth24-25 Eth1-24 Eth24-25 Eth24-25 Eth24-25 Eth1-24 Eth1-24 Eth1-24 T0 T1 T2 T3 2
Network function Implementation 1500B pkt @ 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch 5 100 Firewall (8K rules) Linux iptables 21 480 64B pkt @ 40 Gbps (worst-case estimate) 3
Network function Implementation 1500B pkt @ 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch 5 100 Firewall (8K rules) Linux iptables 21 480 64B pkt @ 40 Gbps (worst-case estimate) 4
5
6
88 h68656c6c6f20776f726c64 Ahhhhhhhhhhhh! 7
Click! N P language fully programmable using high-level Click abstractions familiar to software developers; easy code reuse high throughput; microsecond-scale latency FPGA is no panacea; fine-grained processing separation 8
A B C 9
(reg/mem) (I/O) (main thread) (I/O) (ISR) (interrupt) 10
Element A (FPGA) Element B (CPU) PCIe I/O channel 11
Verilog code (.v) 12
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 13
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 14
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 15
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 16
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 17
ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 18
CPU logger ClickNP Configuration: 19
Count element: CPU logger ClickNP Configuration: 20
21
22
Input pkt Input + + + Output s += pkt[0] s += pkt[1] s += pkt[2] Output s Input + + + Output Input + + + Output Input + + + Output Input + + + Output 23
Read input Read Inc Write Read Inc Write Read mem Read Inc Write Increment Write mem Read read write write Write out 24
Read input Read input Memory read and write can operate in parallel: Read in.addr, Write buf.addr Different memory addresses! Read mem Increment Read buf in.addr = buf.addr? Read mem Write mem Write mem Write out Increment Write buf Write out Delayed write: Buffer new data in a register Delay memory write until next read 25
Read Cache Hit? Cache Read DRAM Output Cache Read DRAM Output Cache Read Read DRAM Output 26
Read Cache Read Cache From fast path Hit? Read DRAM Hit? To slow path From slow path Read DRAM To fast path Output Output 27
Cache Output Cache Output Cache Output Cache Output Cache To slow Read DRAM To fast Output Cache Output 28
Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, tunnel encap/decap, crypto, hash tables, prefix matching, packet scheduling, rate limiting Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 29
Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 30
Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 31
Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser 221.9 113.6 Gbps 11 0.8% 0.2% IPChecksum 226.8 116.1 Gbps 18 2.3% 1.3% NVGRE_Encap 221.8 113.6 Gbps 9 1.5% 0.6% AES_CTR 217.0 27.8 Gbps 70 4.0% 23.1% SHA1 220.8 113.0 Gbps 105 7.9% 6.6% CuckooHash 209.7 209.7 Mpps 38 2.0% 65.5% HashTCAM 207.4 207.4 Mpps 48 18.7% 22.0% LPM_Tree 221.8 221.8 Mpps 181 4.3% 13.2% SRPrioQueue 214.5 214.5 Mpps 41 2.6% 0.6% RateLimiter 141.5 141.5 Mpps 14 16.9% 14.1% 32
Network Function Lines of Code * Number of Elements Resource LE % Pkt generator 13 6 16% 12% Pkt capture 12 11 8% 5% OpenFlow firewall 23 7 32% 54% IPSec gateway 37 10 35% 74% L4 load balancer 42 13 36% 38% pfabric scheduler 23 7 11% 15% Resource BRAM % 33
34
35
36
scheduler pkt 1 pkt n 37
scheduler pkt 1 pkt n 38
ClickNP StrongSwan / Linux (out of the box) Throughput 37.8 Gbps 628 Mbps Latency 13 us (stable) 50us ~ 5ms 39
Nexthop allocation CPU element 40
Nexthop allocation CPU element 41
42
NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 43
NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 44
NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 45
Click! N P 46
Click! N P
48
49
GPU NP FPGA Throughput High High High Latency High Low Low Power High Low Low General computing Yes No Yes 50
51
52
Define elements Define a configuration of elements Host manager program Windows/Linux, Altera/Xilinx 53
A B C Communicate by sharing memory Shared memory is the bottleneck! Batch processing has large latency! 54
A B C Do not communicate by sharing memory; instead, share memory by communicating. -- The slogan of Go language 55
Read key Check key Read counter Read Check Read Inc Write Read Check Read Inc Wr Increment R1 C1 R2 I2 W2 Write counter R1 C1 R2 I2 W2 56
Input Input Input sum i<4 Input Cksum Cksum Input Cksum sum sum += pkt[0] Cksum Cksum i<4 sum sum += pkt[1] Cksum Output Cksum Cksum sum += pkt[i] i<4 sum += pkt[2] Output i<4 sum sum += pkt[3] Output i<4 Output Input Output 57
Read Read Cache Hit? Slow path Slow element Output Read Read Cache Slow path Slow path Hit? To slow element Slow path Output Output Output Output 58