PacketShader as a Future Internet Platform

PacketShader as a Future Internet Platform AsiaFI Summer School 2011.8.11. Sue Moon in collaboration with: Joongi Kim, Seonggu Huh, Sangjin Han, Keon Jang, KyoungSoo Park Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST

Network Systems Routers are the last bastion of niche super computers Cisco CRS-1 4-slot 320Gbps chassis = $15K on ebay Cisco s CRS-3 Multishelf System = 322Tbps =$15K x 10^3 + a Cost = special-purpose ASIC chips + TCAMs + S/W suites Why? People want Huge Freaking Routers! Why do people go for large routers? Managment O/H Routing protocol scalability Hurdles for innovation Special-purpose ASICs, network processors Proprietary S/W (Cisco s IOS, Juniper s JunOS, Huawei s???) Cost 2

Recent Trends New functions keep added Firewalls / IDS / IPS Monitoring and measurement VPN gears VoIP Traffic engineering Switches with minimum L3 functions at the network edge 3

Next Generation Networking vs Future Internet Evolution vs revolution Today s problems w/ hacks and patches vs Architectural re-design to attack the problems at the roots Both approaches need to address ever-increasing traffic demand 2010 2020 1.8B Internet users 5B Internet users 4.6B mobile subscribers 10B mobile subscribers 1.4B connected machines 40B connected machines 800 x (10^18) bytes of info 53 x (10^21) bytes of info Notable breakthrough A new protocol that could replace IP 4

Challenges Ahead Design and demonstrate overarching architecture Routing + security + mobility NSF Future Internet Architecture (FIA) projects XIA NDN Mobility First Nebula NSF GENI projects Testbed building Mostly OpenFlow driven Need a platform to play with Fast but cost-effective Software routers! 5

Positioning of 10+ Gbps Routers Commercial competitor? Core routers with 100+ Tbps capacity? No. Edge routers with 100+ Gbps capacity with complex features? Maybe Experimental platforms? NetFGPA ServerSwitch RouteBricks ATCA-based boxes 6

PacketShader: A GPU-Accelerated Software Router High-performance Our prototype: 40 Gbps on a single box 7

Software Router Despite its name, not limited to IP routing You can implement whatever you want on it. Driven by software Flexible Friendly development environments Based on commodity hardware Cheap Fast evolution 8

Now 10 Gigabit NIC is a commodity From $200 $300 per port Great opportunity for software routers 9

Achilles Heel of Software Routers Low performance Due to CPU bottleneck Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. 2009 RouteBricks Dobrescu et al. Two quad-core CPUs Two quad-core CPUs (2.8 GHz) 4.2 Gbps 8.7 Gbps Not capable of supporting even a single 10G port 10

What is PacketShader? A high-speed software router Presented at SIGCOMM 2010 40 Gbps on a single box Main ideas Optimizing packet I/O Utilizing GPU for packet processing Four applications IPv4, IPv6 forwarding, OpenFlow switch, IPsec gateway 11

WHAT IS GPU? 12

GPU = Graphics Processing Unit The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Ubisoft s AVARTAR, from http://ubi.com) 13

CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of small cores 14

Silicon Budget in CPU and GPU ALU Xeon X5550: 4 cores 731M transistors GTX480: 480 cores 3,200M transistors 15

CPU BOTTLENECK 16

Per-Packet CPU Cycles for 10G IPv4 + 1,200 600 = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6 lookup = 2,800 IPsec + 1,200 5,400 = 6,600 Packet I/O Encryption and hashing Your budget 1,400 cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs (in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours) 17

PacketShader Part 1: I/O Optimization 1,200 + 600 = 1,800 cycles Packet I/O IPv4 lookup 1,200 + 1,600 = 2,800 Packet I/O IPv6 lookup 1,200 Packet I/O + 5,400 = 6,600 Encryption and hashing Packet I/O 1,200 reduced to 200 cycles per packet Main ideas Huge packet buffer Batch processing 18

PacketShader Part 2: GPU Offloading + 600 Packet I/O Packet I/O Packet I/O + + IPv4 lookup 1,600 IPv6 lookup 5,400 Encryption and hashing GPU Offloading for Memory-intensive or Compute-intensive operations 19

OPTIMIZING PACKET I/O ENGINE 20

User-space Packet Processing Packet processing in kernel is bad Processing in user-space is good Kernel has higher scheduling priority; overloaded kernel may starve userlevel processes. Some CPU extensions such as MMX and SSE are not available. Buggy kernel code causes irreversible damage to the system. Rich, friendly development and debugging environment Seamless integration with 3 rd party libraries such as CUDA or OpenSSL Easy to develop virtualized data plane. But packet processing in user-space is known to be 3x times slower! Our solution: (1) batching + (2) better core-queue mapping 21

Inefficiencies of Linux Network Stack Compact metadata Batch processing Huge packet buffer Software prefetch CPU cycle breakdown in packet RX

Huge Packet Buffer eliminates per-packet buffer allocation cost Linux per-packet buffer allocation Our huge packet buffer scheme 23

GPU FOR PACKET PROCESSING 24

Advantages of GPU for Packet Processing 1. Raw computation power 2. Memory access latency 3. Memory bandwidth Comparison between Intel X5550 CPU NVIDIA GTX480 GPU 25

(1/3) Raw Computation Power Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding, compression, etc. GPU can help! Instructions/sec CPU: 43 10 9 = 2.66 (GHz) 4 (# of cores) 4 (4-way superscalar) < GPU: 672 10 9 = 1.4 (GHz) 480 (# of cores) 26

(2/3) Memory Access Latency Software router lots of cache misses GPU can effectively hide memory latency Cache miss Cache miss GPU core Switch to Thread 2 Switch to Thread 3 27

(3/3) Memory Bandwidth CPU s memory bandwidth (theoretical): 32 GB/s 28

(3/3) Memory Bandwidth 2. RX: RAM CPU 3. TX: CPU RAM 1. RX: NIC RAM 4. TX: RAM NIC CPU s memory bandwidth (empirical) < 25 GB/s 29

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s 30

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s GPU s memory bandwidth: 174GB/s 31

PACKETSHADER DESIGN 32

Scaling with Multiple Multi-Core CPUs Shader Device driver Preshader Postshader Device driver Shader 33

Throughput (Gbps) Results (w/ 64B packets) CPU-only CPU+GPU 40 35 30 25 20 15 10 5 0 39.2 38.2 32 28.2 15.6 8 10.2 3 IPv4 IPv6 OpenFlow IPsec GPU speedup 1.4x 4.8x 2.1x 3.5x 34

Example 1: IPv6 forwarding Longest prefix matching on 128-bit IPv6 addresses Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses Prefix length 1 64 80 96 128 35

Throughput (Gbps) Example 1: IPv6 forwarding 45 40 35 30 25 20 15 10 5 0 CPU-only CPU+GPU 64 128 256 512 1024 1514 Packet size (bytes) (Routing table was randomly generated with 200K entries) Bounded by motherboard IO capacity 36

Results Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. Two quad-core CPUs 4.2 Gbps Kernel 2009 RouteBricks Dobrescu et al. Two quad-core CPUs (2.8 GHz) 8.7 Gbps 2010 PacketShader (CPU-only) 2010 PacketShader (CPU+GPU) Two quad-core CPUs (2.66 GHz) Two quad-core CPUs + two GPUs 28.2 Gbps 39.2 Gbps User 37

PacketShader 1.0 GPU a great opportunity for fast packet processing PacketShader Optimized packet I/O + GPU acceleration scalable with # of multi-core CPUs, GPUs, and high-speed NICs Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC 38

Next Steps? 39

PacketShader 2.0 Control plane integration Dynamic routing protocols with Quagga or XORP Opportunistic offloading CPU at low load GPU at high load Multi-functional, modular programming environment Integration with Click? [Kohler99] 40

#3 Multi-functional, modular programming environment Cascading NIC NIC NIC Recv? Y N? N Y? Chunk Subchunk Filter Gathering queue Schedulable task? N Module 1 Y Module 2 Module N Send NIC NIC NIC (drop) slow path (e.g. Linux TCP/IP stack)

Factors behind PacketShader 1.0 Performance Batching, batching, batching! The IO engine (modified NIC driver) uses continuous huge packet buffers called chunks. The user-level process pipelines multiple chunks. The GPU processes multiple chunks in parallel. Hardware-aware optimizations No NUMA node crossing Minimized cache conflicts among multi-cores

Remaining Challenges 100+ Gbps speed Stateful processing Intrusion detection systems / firewalls 43

Review of I/O Capacity for 100+ Gbps QuickPath Interconnect (QPI) CPU socket-to-socket link for remote memory IOH-to-IOH link for I/O traffic CPU-to-IOH for CPU to peripheral connections Today s QPI link 12.8 GB/s or 102.4 Gbps Sandy Bridge On recall at the moment Expected to boost performance to 60 Gbps w/o modification

Review of Memory B/W for 100+ Gbps For 100Gbps forwarding we need 400 Gbps in memory bandwidth + bookkeeping Current configuration triple-channel DDR3 1,333 MHz 32 GB/s per core (theoretical) and 17.9GB/s (empirical) On NUMA system More nodes Careful placement...

Future Work Consider other architectures AMD s APU Tilera s tiled many-core chips Inte s MIC Become a platform for all new FIA architectures Advantage over NetFPGA, ServerSwitch, ATCA solutions 46

For more details https://shader.kaist.edu/ QUESTIONS? THANK YOU! 47