Recent Advances in Software Router Technologies

Size: px

Start display at page:

Download "Recent Advances in Software Router Technologies"

Samuel Bishop
5 years ago
Views:

1 Recent Advances in Software Router Technologies KRNET COEX Sue Moon In collaboration with: Sangjin Han 1, Seungyeop Han 2, Seonggu Huh 3, Keon Jang 4, Joongi Kim, KyoungSoo Park 5 Advanced Networking Lab, CS, KAIST 1 UCB 2 UW 3 UT Austin 4 Microsoft Research Cambridge 5 Networked and Distributed Computing Systems Lab, EE, KAIST

2 Overview Performance upper bound of software routers Cost benefits of commodity-hw-based platforms Battleground of H/W vs S/W routers Recent Advances in software router technologies 60Gbps out of a single server-grade PC Design issues of putting Click on nshader 2

3 Performance Upper Bound of Software Routers Demand for Computing Cycles 10 Gbps = 84B frames x 14.9 Mpps (payload)+4+12B frames 88B frames (or 64B packets) x 14.2 Mpps Packet arrival rate: 67.2ns = 135 cycles (2GHz clock cycles) = 201 cycles (3 GHz clock cycles) Capacity from Commodity Hardware Intel Xeon E GHz 8-core processor 2.6 G x 8 / 14.2 Mpps = 1.46 K cycles / packet 3

4 Per-Packet CPU Cycles IPv4 + 1, = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6 lookup = 2,800 IPsec + 1,200 5,400 = 6,600 Packet I/O Encryption and hashing Optimization techniques PF_RING(DNA) [PF_RING] ~200 cycles psio [Han10] netmap Intel DPDK [Rizzo12] [Intel12] 4

Per-Packet CPU Cycles IPv4 + 1,200 600 = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6

5 Per-Packet CPU Cycles IPv4 + 1, = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6 lookup = 2,800 IPsec + 1,200 5,400 = 6,600 Packet I/O Encryption and hashing Q: Where to get more cycles? Network Processor FPGA GPU 5

6 Network Processors Intel IXP (circa 2007) 400MHz ~ 667MHz w/ DDR2 Cavium s Octeon III CN MIPS cores x 1.6 GHz = 6.4G cycles (x 2~3 from pipelining) PMC WP3SL WinPath3 2 MIPS cores x 650 MHz = 1.3G cycles ~ 15Mpps Broadcom 100Gbps NPU 64 cores x 1GHz = 64G cycles Intel GE controller 6

7 FPGA (full-board solutions) NetFPGA-10G Supports 4x 10GE SFP+ interfaces Xilinx Virtex-5 TX240T FPGA X8 PCI-X Gen2 (5Gbps/lane) $1,675 Tilera TILEcord Card TILEPro64 processor = 8x8 grid general purpose processor cores 443 BOPS 37 Tbps on-chip mesh interconnect 200 Gbps memory b/w Dual 10GE and dual 1GE 7

8 GPU NVIDIA GeForce GTX480 # CUDA Cores 448 (@1215 MHz / 1345 GFLOPS) Memory 1280 MB GDDR5 (320-bit, GB/s) Interface PCIe 2.0 (x16) TDP 250 W Price $499 NVIDIA GeForce GTX680 # CUDA Cores 1536 (@~1058 MHz / GFLOPS) Memory 2048 MB GDDR5 (256-bit, GB/s) Interface PCIe 3.0 (x16) TDP 195 W Price $500 8

9 Why Do We Need Flexibility? Functions that today s routers have IPv4/IPv6/multicast/MPLS/BGP Functions that today s middleboxes perform IDS/IPS/firewall Load balancing VPNs Application gateways Proxies WAN optimizer Functions people test for tomorrow OpenFlow switches SDN 9

10 Battleground of H/W vs S/W routers Line-up of Cisco routers Type Model Aggregate Capacity Notes Core Router CRS Multishelf System Up to 322Tbps CRS 16-Slot = 12.8Tps Edge Router ASR 9000 Series Up to 96Tbps 1GE/10GE/100GE I/F ASR 9922 has 22 slots Aggregation Switch Catalyst 6500 Series Switch Up to 4Tbps 45,000 customers worldwide S/W routers won t be able to compete in forseeable future 10

11 Programmable Switches Juniper EX9200 Family Up to 240 Gbps/slot Custom ASIC design, Junos SDK + Puppet/OpenFlow support Intel Ethernet Switch FM6000 Series Up to 64 x 10 GbE ports Hard-wired OpenFlow matching (Multi-stage TCAM, 64K next-hop table) Arista 7124FX 24x 10 GbE ports FPGA (160 Gbps capacity), Linux tool chain support Nallatech PCIe-385N V-FPGA Computing Card 2x 10 GbE ports Supports OpenCL programming 11

12 Recent Advances in Software Router Technologies RouteBricks [SOSP09] Click + horizontal scaling PacketShader [Han10] psio + GPU-optimized packet processing CoMb [NSDI12] GPU-Click [OSDI12] DoubleClick [APSys12] nshader [work-in-progress] 12

13 RouteBricks: Individual Node 13

14 RouteBricks: System Configuration 14

15 Parallelism in Packet Processing The key insight Stateless packet processing = parallelizable RX queue 1. Batching 2. Parallel Processing in GPU 15

16 Batching Long Latency? Fast link = enough # of packets in a small time window 10 GbE link up to 1,000 packets only in 67μs Much less time with 40 or 100 GbE 16

17 PacketShader: Scaling with Multiple Multi-Core CPUs Shader Device driver Preshader Postshader Device driver Shader 17

18 PacketShader Hardware Setup (I) CPU: Total 8 CPU cores Intel Xeon E5500 (Nehalem) 4-core, 2.66 GHz NIC: Total 80 Gbps Intel X520-DA2 (82599EB) Dual-port 10 GbE GPU: Total 960 cores NVIDIA GTX cores, 1.4 GHz 18

19 Throughput (Gbps) Results (w/ 64B packets) from Setup (I) CPU-only CPU+GPU IPv4 IPv6 OpenFlow IPsec GPU speedup 1.4x 4.8x 2.1x 3.5x 19

20 PacketShader Hardware Setup (II) CPU: Total 16 CPU cores Intel Xeon E (SandyBridge) 8-core, 2.6 GHz NIC: Total 80 Gbps Intel X520-DA2 (82599EB) Dual-port 10 GbE GPU: Total 3072 cores NVIDIA GTX cores, ~1GHz 20

21 Results (w/ 64B packets) from Setup (II) CPU-only CPU+GPU Baseline: L2 forwarding GPU speedup IPv4 IPv6 IPsec 1.02x 2.0x 1.47x 21

packet TX Input Dispatchers wrap packets into tasks and assign engines for each task by a load balancing algorithm.

22 nshader: Framework Architecture Overview packet RX & preproc Input Dispatcher Input Dispatcher Input Dispatcher task input queues CPU Engine(s) GPU Engine(s) Output Dispatcher Output Dispatcher Output Dispatcher task output queues postproc & packet TX Input Dispatchers wrap packets into tasks and assign engines for each task by a load balancing algorithm. Engines do actual processing or interaction with other types of processors (GPUs). Output Dispatchers aggregate results and transmit. 22

23 Why Should We Invest in Software Routers? Nurture early adopters of commodity hardware for network processing Build barebone middleboxes capable of Fast packet IO & computation Batching HW-assisted multi-queue support/utilization Develop cost-effective solutions SSL accelerator IDS/Firewall/Proxy WAN optimizer Increasing computational demand of the Internet 1G-for-every-home adds burden of middleboxes 23

24 Rise of Middleboxes Across all network sizes, the number of middleboxes is on par with the number of routers in a network! [Sherry12] 24

25 Q & A 25

26 References [Han10] S. Han et al., PacketShader: a GPU-accelerated software router, SIGCOMM 2010 [Intel12] Intel Data Plane Development Kit [PR_RING] [Rizzo12] L. Rizzo, netmap: a novel framework for fast packet I/O, USENIX ATC 2012 [Sherry12] J. Sherry et al., Making middleboxes someone else s problem: network processing as a cloud service, SIGCOMM

27 BACKUP SLIDES 27

28 System Price Comparison SSL Accelerator CISCO AIM-VPN/SSL-3 plugin module: ~ $2,550 / ~ 0.2 Gbps CISCO Catalyst 6500 SSL module: ~ $30,000 / ~ 8 80 Gbps* SSL Shader (X x GTX580): < $10,000 / ~ 13 Gbps** IDS (Intrusion Detection System) CISCO Catalyst 6500 IDSM-2 module: ~ $30,000 / ~ 0.6 Gbps (passive) *: up to 10 modules in a single chassis **: conservative estimation in price Kargus (X x GTX580): < $10,000 / ~ 30 Gbps**

29 Per-Packet CPU Cycles IPv4 + 1, = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6 lookup = 2,800 IPsec + 1,200 5,400 = 6,600 Packet I/O Encryption and hashing psio netmap Intel DPDK ~200 cycles [Han2010] ~200 cycles [Rizzo2012] ~200 cycles [Intel2012] 29

30 Our Approach 1: I/O Optimization 1, = 1,800 cycles Packet I/O IPv4 lookup 1, ,600 = 2,800 Packet I/O IPv6 lookup 1,200 Packet I/O + 5,400 = 6,600 Encryption and hashing Packet I/O 1,200 reduced to 200 cycles per packet Main ideas Huge packet buffer Batch processing 30

31 Silicon Budget in CPU and GPU ALU Xeon X5550: 4 cores 731M transistors GTX480: 480 cores 3,200M transistors 31

32 Silicon Budget in CPU and GPU ALU Xeon X5550: 4 cores 731M transistors GTX480: 480 cores 3,200M transistors 32

33 GPU 33

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,