GPGPU introduction and network applications. PacketShaders, SSLShader

Size: px

Start display at page:

Download "GPGPU introduction and network applications. PacketShaders, SSLShader"

Clyde Griffin
5 years ago
Views:

1 GPGPU introduction and network applications PacketShaders, SSLShader

2 Agenda GPGPU Introduction Computer graphics background GPGPUs past, present and future PacketShader A GPU-Accelerated Software Router SSLShader A GPU-Accelerated SSL encryption/decryption proxy 2014 Mellanox Technologies 2

3 GPGPU Intro 2014 Mellanox Technologies 3

4 GPU = Graphics Processing Unit The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Ubisoft s AVARTAR, from Mellanox Technologies 4

5 GPU Fundamentals: The Graphics Pipeline Graphics State Application Transform Rasterizer Shade Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Video Memory (Textures) Render-to-texture CPU GPU A simplified graphics pipeline Note that pipe widths vary Many caches, FIFOs, and so on not shown 2014 Mellanox Technologies 5

6 GPU Pipeline: Transform Vertex Processor (multiple operate in parallel) Transform from world space to image space Compute per-vertex lighting 2014 Mellanox Technologies 6

7 GPU Pipeline: Rasterizer Rasterizer Convert geometric rep. (vertex) to image rep. (fragment) - Fragment = image fragment Pixel + associated data: color, depth, stencil, etc. Interpolate per-vertex quantities across pixels 2014 Mellanox Technologies 7

8 GPU Pipeline: Shade Fragment Processors (multiple in parallel) Compute a color for each pixel Optionally read colors from textures (images) 2014 Mellanox Technologies 8

(3D) Vertex Processor Xformed, Lit Vertices

Rasterizer Fragments (pre-pixels) Fragment

Render-to-texture GPU Programmable pixel

9 GPU Fundamentals: The Modern Graphics Pipeline Graphics State Application Vertices (3D) Vertex Processor Xformed, Lit Vertices (2D) CPU Programmable vertex processor! Rasterizer Fragments (pre-pixels) Fragment Pixel Processor Final pixels (Color, Depth) Render-to-texture GPU Programmable pixel processor! Video Memory (Textures) 2014 Mellanox Technologies 9

nvidia G80 GPU Architecture Overview 16 Multiprocessors Blocks Each MP Block Has: 8 Streaming Processors (IEEE 754 spfp compliant) 16K Shared Memory 64K Constant Cache 8K

10 nvidia G80 GPU Architecture Overview 16 Multiprocessors Blocks Each MP Block Has: 8 Streaming Processors (IEEE 754 spfp compliant) 16K Shared Memory 64K Constant Cache 8K Texture Cache Each processor can access all of the memory at 86Gb/s, but with different latencies: Shared 2 cycle latency Device 300 cycle latency 2014 Mellanox Technologies 10

11 Queueing FIFO buffering (first-in, first-out) is provided between task stages Accommodates variation in execution time Provides elasticity to allow unified load balancing to work FIFOs can also be unified Share a single large memory with multiple head-tail pairs Allocate as required Application Vertex assembly FIFO Vertex operations FIFO Primitive assembly FIFO 2014 Mellanox Technologies 11

12 SIMT - Memory Access Latency Hiding GPU can effectively hide memory latency Cache miss Cache miss GPU core Switch to Thread 2 Switch to Thread Mellanox Technologies 12

13 Thread Processor Implementation vs. architecture model Application Application Data Assembler Setup / Rstr / ZCull Vertex assembly Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex operations SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Primitive assembly TF TF TF TF TF TF TF TF Primitive operations L1 L1 L1 L1 L1 L1 L1 L1 Rasterization L2 L2 L2 L2 L2 L2 Fragment operations FB FB FB FB FB FB Framebuffer NVIDIA GeForce 8800 OpenGL Pipeline Source : NVIDIA 2014 Mellanox Technologies 13

14 Thread Processor Correspondence (by color) Fixed-function assembly processors Applicationprogrammable parallel processor Application Data Assembler this was missing Setup / Rstr / ZCull Application Vertex assembly Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex operations SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Primitive assembly TF TF TF TF TF TF TF TF Primitive operations L1 L1 L1 L1 L1 L1 Fixed-function L1 framebuffer operations L1 Rasterization (fragment assembly) L2 L2 L2 L2 L2 L2 Fragment operations FB FB FB FB FB FB Framebuffer NVIDIA GeForce 8800 OpenGL Pipeline 2014 Mellanox Technologies 14

15 The nvidia G80 GPU 128 streaming floating point 1.5 Gb Shared RAM with 86Gb/s bandwidth 500 Gflop on one chip (single precision) 2014 Mellanox Technologies 17

16 Why are GPU s so fast? Entertainment Industry has driven the economy of these chips? Males age buy $10B in video games / year Moore s Law ++ Simplified design (stream processing) Huge parallelism maps well to hardware Latency hiding using the parallelism Single-chip designs Mellanox Technologies 18

17 Silicon Budget in CPU and GPU ALU Xeon X5550: 4 cores 731M transistors GTX480: 480 cores 3,200M transistors Mellanox Technologies 19

18 Floorplans comparison CPU - Core i7 GPU nvidia Kepler 2014 Mellanox Technologies 21

19 GPU - A Specialized Processor Very Efficient For Fast Parallel Floating Point Processing Single Instruction Multiple Data Operations High Computation per Memory Access Not As Efficient For Double Precision situation is improving Logical Operations on Integer Data Branching-Intensive Operations Random Access, Memory-Intensive Operations 2014 Mellanox Technologies 22

20 GPGPU! Programable stream processor Huge number of ALUs Huge memory bandwidth Programming was painful OpenGL-SL Shader Language Requires deep understanding of computers graphics Huge applications speedup when done correctly CUDA/OpenCL C-like code Massively multi-threaded Simple to port existing code (but not to get good performance) 2014 Mellanox Technologies 23

21 CUDA Single Instruction Multiple Threads 2014 Mellanox Technologies 24

22 Achieving Performance in CUDA Almost all C code will compile to be CUDA code But will run slower Single threaded operation - ~50x slower than CPU code Must expose parallelism Careful with memory accesses Thread scheduling helps hide memory access latency But even this runs out Moving target Performance optimizations are strongly HW and SW platform dependent Can make huge difference 100x and even more 2014 Mellanox Technologies 25

23 PacketShader A GPU-Accelerated Software Router 2014 Mellanox Technologies 26

24 High Performance Software Router Work by Sangjin Han, Keon Jang, KyoungSoo Park and Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST 40 Gbps packet forwarding in a single box IPv4, 64B packets Bigger packet sizes bounded by PCI-e bandwidth 20 Gbps IPsec tunneling For 1024B packets 10 Gbps for 64B packets 2014 Mellanox Technologies 27

25 Software Router Despite its name, not limited to IP routing You can implement whatever you want on it. Driven by software Flexible Friendly development environments Based on commodity hardware Cheap Fast evolution 2014 Mellanox Technologies 28

26 Now 10 Gigabit NIC is a commodity From $200 $300 per port Great opportunity for software routers 2014 Mellanox Technologies 29

27 Achilles Heel of Software Routers Low performance Due to CPU bottleneck Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps Enhanced SR Bolla et al. RouteBricks Dobrescu et al. Two quad-core CPUs Two quad-core CPUs (2.8 GHz) 4.2 Gbps 8.7 Gbps Not capable of supporting even a single 10G port 2014 Mellanox Technologies 30

28 Per-Packet CPU Cycles for 10G Cycles needed IPv4 IPv6 IPsec + = 1,800 cycles Packet I/O IPv4 lookup 1, Packet I/O IPv6 lookup 1,200 1,600 + Packet I/O Encryption and hashing = 2,800 1,200 5,400 = 6,600 Your budget 1,400 cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs (in x86, cycle numbers are from RouteBricks [Dobrescu09] and PacketShader) 2014 Mellanox Technologies 31

29 PacketShader Approach 1: I/O Optimization 1,200 1,200 1,200 Packet I/O Packet I/O Packet I/O = 1,800 cycles IPv4 lookup 1,600 = 2,800 IPv6 lookup 5,400 = 6,600 Encryption and hashing Packet I/O 1,200 reduced to 200 cycles per packet Main ideas Huge packet buffer Batch processing Allocating SKBs 50% of CPU time 2014 Mellanox Technologies 32

30 PacketShader Approach 2: GPU Offloading Packet I/O Packet I/O Packet I/O IPv4 lookup 1,600 IPv6 lookup 5,400 Encryption and hashing GPU Offloading for Memory-intensive or Compute-intensive operations Main topic of this talk 2014 Mellanox Technologies 33

31 GPU FOR PACKET PROCESSING 2014 Mellanox Technologies 34

32 Advantages of GPU for Packet Processing 1. Raw computation power 2. Memory access latency 3. Memory bandwidth Comparison between Intel X5550 CPU NVIDIA GTX480 GPU 2014 Mellanox Technologies 35

33 (1/3) Raw Computation Power Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding, compression, etc. GPU can help! Instructions/sec CPU: = 2.66 (GHz) 4 (# of cores) 4 (4-way superscalar) < GPU: = 1.4 (GHz) 480 (# of cores) 2014 Mellanox Technologies 36

34 (2/3) Memory Access Latency Software router lots of cache misses GPU can effectively hide memory latency Cache miss Cache miss GPU core Switch to Thread 2 Switch to Thread Mellanox Technologies 37

35 (3/3) Memory Bandwidth CPU s memory bandwidth (theoretical): 32 GB/s 2014 Mellanox Technologies 38

36 (3/3) Memory Bandwidth 2. RX: RAM CPU 3. TX: CPU RAM 1. RX: NIC RAM 4. TX: RAM NIC CPU s memory bandwidth (empirical) < 25 GB/s 2014 Mellanox Technologies 39

37 (3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s 2014 Mellanox Technologies 40

38 (3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s GPU s memory bandwidth: 174GB/s 2014 Mellanox Technologies 41

39 HOW TO USE GPU 2014 Mellanox Technologies 42

40 Basic Idea Offload core operations to GPU (e.g., forwarding table lookup) 2014 Mellanox Technologies 43

41 Recap For GPU, more parallelism, more throughput GTX480: 480 cores 2014 Mellanox Technologies 44

42 Parallelism in Packet Processing The key insight Stateless packet processing = parallelizable RX queue 1. Batching 2. Parallel Processing in GPU 2014 Mellanox Technologies 45

43 Batching Long Latency? Fast link = enough # of packets in a small time window 10 GbE link up to 1,000 packets only in 67μs Much less time with 40 or 100 GbE 2014 Mellanox Technologies 46

44 PACKETSHADER DESIGN 2014 Mellanox Technologies 47

45 Basic Design Three stages in a streamline Shader Preshader Postshader 2014 Mellanox Technologies 48

46 Packet s Journey (1/3) IPv4 forwarding example Checksum, TTL Format check Collected dst. IP addrs Shader Preshader Postshader Some packets go to slow-path 2014 Mellanox Technologies 49

47 Packet s Journey (2/3) IPv4 forwarding example 2. Forwarding table lookup 1. IP addresses 3. Next hops Shader Preshader Postshader 2014 Mellanox Technologies 50

48 Packet s Journey (3/3) IPv4 forwarding example Update packets and transmit Shader Preshader Postshader 2014 Mellanox Technologies 51

49 Interfacing with NICs Packet RX Packet TX Device driver Shader Preshader Postshader Device driver 2014 Mellanox Technologies 52

50 Scaling with a Multi-Core CPU Master core Shader Device driver Preshader Postshader Device driver Worker cores 2014 Mellanox Technologies 53

51 Scaling with Multiple Multi-Core CPUs Shader Device driver Preshader Postshader Device driver Shader 2014 Mellanox Technologies 54

52 EVALUATION 2014 Mellanox Technologies 55

53 Hardware Setup CPU: Total 8 CPU cores Quad-core, 2.66 GHz NIC: Total 80 Gbps Dual-port 10 GbE GPU: Total 960 cores 480 cores, 1.4 GHz 2014 Mellanox Technologies 56

54 Experimental Setup Input traffic Processed packets Packet generator (Up to 80 Gbps) 8 10 GbE links PacketShader 2014 Mellanox Technologies 57

55 Throughput (Gbps) Results (w/ 64B packets) CPU-only CPU+GPU IPv4 IPv6 OpenFlow IPsec GPU speedup 1.4x 4.8x 2.1x 3.5x 2014 Mellanox Technologies 58

56 Example 1: IPv6 forwarding Longest prefix matching on 128-bit IPv6 addresses Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses Prefix length Mellanox Technologies 59

64 128 256 512 1024 1514 Packet size (bytes) (Routing table was

57 Throughput (Gbps) Example 1: IPv6 forwarding CPU-only CPU+GPU Bounded by motherboard IO capacity Packet size (bytes) (Routing table was randomly generated with 200K entries) 2014 Mellanox Technologies 60

Example 2: IPsec tunneling ESP (Encapsulating Security Payload) Tunnel mode with AES-CTR (encryption) and SHA1 (authentication) Original IP packet IP header IP payload IP header IP

58 Example 2: IPsec tunneling ESP (Encapsulating Security Payload) Tunnel mode with AES-CTR (encryption) and SHA1 (authentication) Original IP packet IP header IP payload IP header IP payload + ESP trailer 1. AES ESP header + IP header IP payload ESP trailer 2. SHA1 IPsec Packet New IP header + ESP header IP header IP payload ESP trailer ESP Auth Mellanox Technologies 61

59 Throughput (Gbps) Speedup Example 2: IPsec tunneling 3.5x speedup CPU-only CPU+GPU Speedup Packet size (bytes) Mellanox Technologies 62

Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. Two quad-core CPUs 4.2 Gbps Kernel 2009 RouteBricks Dobrescu et al.

60 Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. Two quad-core CPUs 4.2 Gbps Kernel 2009 RouteBricks Dobrescu et al. Two quad-core CPUs (2.8 GHz) 8.7 Gbps 2010 PacketShader (CPU-only) 2010 PacketShader (CPU+GPU) Two quad-core CPUs (2.66 GHz) Two quad-core CPUs + two GPUs 28.2 Gbps 39.2 Gbps User 2014 Mellanox Technologies 63

61 Conclusions GPU a great opportunity for fast packet processing PacketShader Optimized packet I/O + GPU acceleration scalable with - # of multi-core CPUs, GPUs, and high-speed NICs Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC 2014 Mellanox Technologies 64

62 Future Work Control plane integration Dynamic routing protocols with Quagga or Xorp Multi-functional, modular programming environment Integration with Click? [Kohler99] Opportunistic offloading CPU at low load GPU at high load Stateful packet processing 2014 Mellanox Technologies 65

63 SSLShader A GPU-Accelerated Software Router 2014 Mellanox Technologies 66

64 2014 Mellanox Technologies 67

65 Thank You

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,