L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4.

Size: px

Start display at page:

Download "L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4."

Cathleen Hunter
5 years ago
Views:

1 Click! N P

2 Eth33 Eth1-4 Eth1-4 C0 C1 Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4 L0 L1 L2 L3 Eth24-25 Eth1-24 Eth24-25 Eth24-25 Eth24-25 Eth1-24 Eth1-24 Eth1-24 T0 T1 T2 T3 2

3 Network function Implementation 1500B 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch Firewall (8K rules) Linux iptables B 40 Gbps (worst-case estimate) 3

4 Network function Implementation 1500B 40 Gbps (normal case) NVGRE tunnel encapsulation Hyper-V virtual switch Firewall (8K rules) Linux iptables B 40 Gbps (worst-case estimate) 4

5 5

6 6

7 88 h68656c6c6f20776f726c64 Ahhhhhhhhhhhh! 7

8 Click! N P language fully programmable using high-level Click abstractions familiar to software developers; easy code reuse high throughput; microsecond-scale latency FPGA is no panacea; fine-grained processing separation 8

9 A B C 9

10 (reg/mem) (I/O) (main thread) (I/O) (ISR) (interrupt) 10

11 Element A (FPGA) Element B (CPU) PCIe I/O channel 11

12 Verilog code (.v) 12

13 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 13

14 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 14

15 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 15

16 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 16

17 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 17

18 ClickNP host process Mgr thrd Worker thrd ClickNP elements ClickNP library PCIe I/O channel vendor libs ClickNP host mgr ClickNP script Host FPGA Catapult PCIe Driver Catapult shell ClickNP vendor specific runtime ClickNP compiler C compiler vendor HLS Cross-platform toolchain Altera OpenCL / Vivado HLS Visual Studio / GCC 18

19 CPU logger ClickNP Configuration: 19

20 Count element: CPU logger ClickNP Configuration: 20

21 21

22 22

23 Input pkt Input Output s += pkt[0] s += pkt[1] s += pkt[2] Output s Input Output Input Output Input Output Input Output 23

24 Read input Read Inc Write Read Inc Write Read mem Read Inc Write Increment Write mem Read read write write Write out 24

25 Read input Read input Memory read and write can operate in parallel: Read in.addr, Write buf.addr Different memory addresses! Read mem Increment Read buf in.addr = buf.addr? Read mem Write mem Write mem Write out Increment Write buf Write out Delayed write: Buffer new data in a register Delay memory write until next read 25

26 Read Cache Hit? Cache Read DRAM Output Cache Read DRAM Output Cache Read Read DRAM Output 26

27 Read Cache Read Cache From fast path Hit? Read DRAM Hit? To slow path From slow path Read DRAM To fast path Output Output 27

28 Cache Output Cache Output Cache Output Cache Output Cache To slow Read DRAM To fast Output Cache Output 28

29 Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, tunnel encap/decap, crypto, hash tables, prefix matching, packet scheduling, rate limiting Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser Gbps % 0.2% IPChecksum Gbps % 1.3% NVGRE_Encap Gbps 9 1.5% 0.6% AES_CTR Gbps % 23.1% SHA Gbps % 6.6% CuckooHash Mpps % 65.5% HashTCAM Mpps % 22.0% LPM_Tree Mpps % 13.2% SRPrioQueue Mpps % 0.6% RateLimiter Mpps % 14.1% 29

30 Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser Gbps % 0.2% IPChecksum Gbps % 1.3% NVGRE_Encap Gbps 9 1.5% 0.6% AES_CTR Gbps % 23.1% SHA Gbps % 6.6% CuckooHash Mpps % 65.5% HashTCAM Mpps % 22.0% LPM_Tree Mpps % 13.2% SRPrioQueue Mpps % 0.6% RateLimiter Mpps % 14.1% 30

31 Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser Gbps % 0.2% IPChecksum Gbps % 1.3% NVGRE_Encap Gbps 9 1.5% 0.6% AES_CTR Gbps % 23.1% SHA Gbps % 6.6% CuckooHash Mpps % 65.5% HashTCAM Mpps % 22.0% LPM_Tree Mpps % 13.2% SRPrioQueue Mpps % 0.6% RateLimiter Mpps % 14.1% 31

32 Element Fmax (MHz) Peak Throughput Nearly 100 elements 20% re-factored from Click modular router Cover packet parsing, checksum, encap/decap, hash tables, prefix matching, rate limiting, crypto, packet scheduling Delay (cycles) Resource LE % Throughput: 200 Mpps / 100 Gbps Mean delay: 0.19 us, max delay: 0.8 us Mean LoC: 80, max LoC: 196 Resource BRAM % L4_Parser Gbps % 0.2% IPChecksum Gbps % 1.3% NVGRE_Encap Gbps 9 1.5% 0.6% AES_CTR Gbps % 23.1% SHA Gbps % 6.6% CuckooHash Mpps % 65.5% HashTCAM Mpps % 22.0% LPM_Tree Mpps % 13.2% SRPrioQueue Mpps % 0.6% RateLimiter Mpps % 14.1% 32

33 Network Function Lines of Code * Number of Elements Resource LE % Pkt generator % 12% Pkt capture % 5% OpenFlow firewall % 54% IPSec gateway % 74% L4 load balancer % 38% pfabric scheduler % 15% Resource BRAM % 33

34 34

35 35

36 36

37 scheduler pkt 1 pkt n 37

38 scheduler pkt 1 pkt n 38

39 ClickNP StrongSwan / Linux (out of the box) Throughput 37.8 Gbps 628 Mbps Latency 13 us (stable) 50us ~ 5ms 39

40 Nexthop allocation CPU element 40

41 Nexthop allocation CPU element 41

42 42

43 NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 43

44 NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 44

45 NetFPFA Function Resource Utilization Min / Max LUTs Registers BRAMs Input arbiter 2.1x / 3.4x 1.8x / 2.8x 0.9x / 1.3x Output queue 1.4x / 2.0x 2.0x / 3.2x 0.9x / 1.2x Header parser 0.9x / 3.2x 2.1x / 3.2x N/A Openflow table 0.9x / 1.6x 1.6x / 2.3x 1.1x / 1.2x IP checksum 4.3x / 12.1x 9.7x / 32.5x N/A Encap 0.9x / 5.2x 1.1x / 10.3x N/A 45

46 Click! N P 46

47 Click! N P

48 48

49 49

50 GPU NP FPGA Throughput High High High Latency High Low Low Power High Low Low General computing Yes No Yes 50

51 51

52 52

53 Define elements Define a configuration of elements Host manager program Windows/Linux, Altera/Xilinx 53

54 A B C Communicate by sharing memory Shared memory is the bottleneck! Batch processing has large latency! 54

55 A B C Do not communicate by sharing memory; instead, share memory by communicating. -- The slogan of Go language 55

56 Read key Check key Read counter Read Check Read Inc Write Read Check Read Inc Wr Increment R1 C1 R2 I2 W2 Write counter R1 C1 R2 I2 W2 56

57 Input Input Input sum i<4 Input Cksum Cksum Input Cksum sum sum += pkt[0] Cksum Cksum i<4 sum sum += pkt[1] Cksum Output Cksum Cksum sum += pkt[i] i<4 sum += pkt[2] Output i<4 sum sum += pkt[3] Output i<4 Output Input Output 57

58 Read Read Cache Hit? Slow path Slow element Output Read Read Cache Slow path Slow path Hit? To slow element Slow path Output Output Output Output 58

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,