PacketShader as a Future Internet Platform

Similar documents
PacketShader: A GPU-Accelerated Software Router

GPGPU introduction and network applications. PacketShaders, SSLShader

The Power of Batching in the Click Modular Router

Recent Advances in Software Router Technologies

소프트웨어기반고성능침입탐지시스템설계및구현

Computer Networks CS 552

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

OpenFlow Software Switch & Intel DPDK. performance analysis

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

G-NET: Effective GPU Sharing In NFV Systems

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Forwarding Architecture

Intel Workstation Technology

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

RouteBricks: Exploiting Parallelism To Scale Software Routers

A 400Gbps Multi-Core Network Processor

100 Gbps Open-Source Software Router? It's Here. Jim Thompson, CTO, Netgate

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Advanced Computer Networks. End Host Optimization

DPDK Summit China 2017

NBA (Network Balancing Act): A High-performance Packet Processing Framework for Heterogeneous Processors

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

The rcuda middleware and applications

Next Generation Enterprise Solutions from ARM

Lecture 1: Gentle Introduction to GPUs

Toward a unified architecture for LAN/WAN/WLAN/SAN switches and routers

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

LegUp: Accelerating Memcached on Cloud FPGAs

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

P51: High Performance Networking

The Power of Batching in the Click Modular Router

40 Gbps IPsec on Commodity Hardware. Jim Thompson Founder & CTO Netgate

Accelerating 4G Network Performance

Much Faster Networking

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

Improving Packet Processing Performance of a Memory- Bounded Application

Cisco ASR 1000 Series Routers Embedded Services Processors

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

It's kind of fun to do the impossible with DPDK Yoshihiro Nakajima, Hirokazu Takahashi, Kunihiro Ishiguro, Koji Yamazaki NTT Labs

Evolution of the netmap architecture

15-744: Computer Networking. Middleboxes and NFV

VALE: a switched ethernet for virtual machines

Re-architecting Virtualization in Heterogeneous Multicore Systems

Arrakis: The Operating System is the Control Plane

Anand Raghunathan

Cisco ASR 1000 Series Aggregation Services Routers: QoS Architecture and Solutions

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Cisco Nexus 9508 Switch Power and Performance

P4GPU: A Study of Mapping a P4 Program onto GPU Target

CME 213 S PRING Eric Darve

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

ECE 571 Advanced Microprocessor-Based Design Lecture 20

NUMA replicated pagecache for Linux

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

NetFPGA Update at GEC4

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Addressing the Memory Wall

CSE 123A Computer Networks

Topic & Scope. Content: The course gives

A Network-centric TCP for Interactive Video Delivery Networks (VDN)

Advanced Parallel Programming I

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

CellSDN: Software-Defined Cellular Core networks

Revisiting router architectures with Zipf

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

CS427 Multicore Architecture and Parallel Computing

Multimedia in Mobile Phones. Architectures and Trends Lund

PC BUILDING PRESENTED BY

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Improving DPDK Performance

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

Pactron FPGA Accelerated Computing Solutions

Efficient and Scalable Shading for Many Lights

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

Networking at the Speed of Light

Evaluating the Suitability of Server Network Cards for Software Routers

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

Controlling Parallelism in a Multicore Software Router

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance

Gen-Z Memory-Driven Computing

Getting Real Performance from a Virtualized CCAP

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

10GE network tests with UDP. Janusz Szuba European XFEL

Silicon Motion s Graphics Display SoCs

Software and Tools for HPE s The Machine Project

SAP High-Performance Analytic Appliance on the Cisco Unified Computing System

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

Distributed Systems. 01. Introduction. Paul Krzyzanowski. Rutgers University. Fall 2013

Transcription:

PacketShader as a Future Internet Platform AsiaFI Summer School 2011.8.11. Sue Moon in collaboration with: Joongi Kim, Seonggu Huh, Sangjin Han, Keon Jang, KyoungSoo Park Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST

Network Systems Routers are the last bastion of niche super computers Cisco CRS-1 4-slot 320Gbps chassis = $15K on ebay Cisco s CRS-3 Multishelf System = 322Tbps =$15K x 10^3 + a Cost = special-purpose ASIC chips + TCAMs + S/W suites Why? People want Huge Freaking Routers! Why do people go for large routers? Managment O/H Routing protocol scalability Hurdles for innovation Special-purpose ASICs, network processors Proprietary S/W (Cisco s IOS, Juniper s JunOS, Huawei s???) Cost 2

Recent Trends New functions keep added Firewalls / IDS / IPS Monitoring and measurement VPN gears VoIP Traffic engineering Switches with minimum L3 functions at the network edge 3

Next Generation Networking vs Future Internet Evolution vs revolution Today s problems w/ hacks and patches vs Architectural re-design to attack the problems at the roots Both approaches need to address ever-increasing traffic demand 2010 2020 1.8B Internet users 5B Internet users 4.6B mobile subscribers 10B mobile subscribers 1.4B connected machines 40B connected machines 800 x (10^18) bytes of info 53 x (10^21) bytes of info Notable breakthrough A new protocol that could replace IP 4

Challenges Ahead Design and demonstrate overarching architecture Routing + security + mobility NSF Future Internet Architecture (FIA) projects XIA NDN Mobility First Nebula NSF GENI projects Testbed building Mostly OpenFlow driven Need a platform to play with Fast but cost-effective Software routers! 5

Positioning of 10+ Gbps Routers Commercial competitor? Core routers with 100+ Tbps capacity? No. Edge routers with 100+ Gbps capacity with complex features? Maybe Experimental platforms? NetFGPA ServerSwitch RouteBricks ATCA-based boxes 6

PacketShader: A GPU-Accelerated Software Router High-performance Our prototype: 40 Gbps on a single box 7

Software Router Despite its name, not limited to IP routing You can implement whatever you want on it. Driven by software Flexible Friendly development environments Based on commodity hardware Cheap Fast evolution 8

Now 10 Gigabit NIC is a commodity From $200 $300 per port Great opportunity for software routers 9

Achilles Heel of Software Routers Low performance Due to CPU bottleneck Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. 2009 RouteBricks Dobrescu et al. Two quad-core CPUs Two quad-core CPUs (2.8 GHz) 4.2 Gbps 8.7 Gbps Not capable of supporting even a single 10G port 10

What is PacketShader? A high-speed software router Presented at SIGCOMM 2010 40 Gbps on a single box Main ideas Optimizing packet I/O Utilizing GPU for packet processing Four applications IPv4, IPv6 forwarding, OpenFlow switch, IPsec gateway 11

WHAT IS GPU? 12

GPU = Graphics Processing Unit The heart of graphics cards Mainly used for real-time 3D game rendering Massively-parallel processing capacity (Ubisoft s AVARTAR, from http://ubi.com) 13

CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of small cores 14

Silicon Budget in CPU and GPU ALU Xeon X5550: 4 cores 731M transistors GTX480: 480 cores 3,200M transistors 15

CPU BOTTLENECK 16

Per-Packet CPU Cycles for 10G IPv4 + 1,200 600 = 1,800 cycles Cycles needed IPv6 Packet I/O IPv4 lookup + 1,200 1,600 Packet I/O IPv6 lookup = 2,800 IPsec + 1,200 5,400 = 6,600 Packet I/O Encryption and hashing Your budget 1,400 cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs (in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours) 17

PacketShader Part 1: I/O Optimization 1,200 + 600 = 1,800 cycles Packet I/O IPv4 lookup 1,200 + 1,600 = 2,800 Packet I/O IPv6 lookup 1,200 Packet I/O + 5,400 = 6,600 Encryption and hashing Packet I/O 1,200 reduced to 200 cycles per packet Main ideas Huge packet buffer Batch processing 18

PacketShader Part 2: GPU Offloading + 600 Packet I/O Packet I/O Packet I/O + + IPv4 lookup 1,600 IPv6 lookup 5,400 Encryption and hashing GPU Offloading for Memory-intensive or Compute-intensive operations 19

OPTIMIZING PACKET I/O ENGINE 20

User-space Packet Processing Packet processing in kernel is bad Processing in user-space is good Kernel has higher scheduling priority; overloaded kernel may starve userlevel processes. Some CPU extensions such as MMX and SSE are not available. Buggy kernel code causes irreversible damage to the system. Rich, friendly development and debugging environment Seamless integration with 3 rd party libraries such as CUDA or OpenSSL Easy to develop virtualized data plane. But packet processing in user-space is known to be 3x times slower! Our solution: (1) batching + (2) better core-queue mapping 21

Inefficiencies of Linux Network Stack Compact metadata Batch processing Huge packet buffer Software prefetch CPU cycle breakdown in packet RX

Huge Packet Buffer eliminates per-packet buffer allocation cost Linux per-packet buffer allocation Our huge packet buffer scheme 23

GPU FOR PACKET PROCESSING 24

Advantages of GPU for Packet Processing 1. Raw computation power 2. Memory access latency 3. Memory bandwidth Comparison between Intel X5550 CPU NVIDIA GTX480 GPU 25

(1/3) Raw Computation Power Compute-intensive operations in software routers Hashing, encryption, pattern matching, network coding, compression, etc. GPU can help! Instructions/sec CPU: 43 10 9 = 2.66 (GHz) 4 (# of cores) 4 (4-way superscalar) < GPU: 672 10 9 = 1.4 (GHz) 480 (# of cores) 26

(2/3) Memory Access Latency Software router lots of cache misses GPU can effectively hide memory latency Cache miss Cache miss GPU core Switch to Thread 2 Switch to Thread 3 27

(3/3) Memory Bandwidth CPU s memory bandwidth (theoretical): 32 GB/s 28

(3/3) Memory Bandwidth 2. RX: RAM CPU 3. TX: CPU RAM 1. RX: NIC RAM 4. TX: RAM NIC CPU s memory bandwidth (empirical) < 25 GB/s 29

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s 30

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s GPU s memory bandwidth: 174GB/s 31

PACKETSHADER DESIGN 32

Scaling with Multiple Multi-Core CPUs Shader Device driver Preshader Postshader Device driver Shader 33

Throughput (Gbps) Results (w/ 64B packets) CPU-only CPU+GPU 40 35 30 25 20 15 10 5 0 39.2 38.2 32 28.2 15.6 8 10.2 3 IPv4 IPv6 OpenFlow IPsec GPU speedup 1.4x 4.8x 2.1x 3.5x 34

Example 1: IPv6 forwarding Longest prefix matching on 128-bit IPv6 addresses Algorithm: binary search on hash tables [Waldvogel97] 7 hashings + 7 memory accesses Prefix length 1 64 80 96 128 35

Throughput (Gbps) Example 1: IPv6 forwarding 45 40 35 30 25 20 15 10 5 0 CPU-only CPU+GPU 64 128 256 512 1024 1514 Packet size (bytes) (Routing table was randomly generated with 200K entries) Bounded by motherboard IO capacity 36

Results Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 Enhanced SR Bolla et al. Two quad-core CPUs 4.2 Gbps Kernel 2009 RouteBricks Dobrescu et al. Two quad-core CPUs (2.8 GHz) 8.7 Gbps 2010 PacketShader (CPU-only) 2010 PacketShader (CPU+GPU) Two quad-core CPUs (2.66 GHz) Two quad-core CPUs + two GPUs 28.2 Gbps 39.2 Gbps User 37

PacketShader 1.0 GPU a great opportunity for fast packet processing PacketShader Optimized packet I/O + GPU acceleration scalable with # of multi-core CPUs, GPUs, and high-speed NICs Current Prototype Supports IPv4, IPv6, OpenFlow, and IPsec 40 Gbps performance on a single PC 38

Next Steps? 39

PacketShader 2.0 Control plane integration Dynamic routing protocols with Quagga or XORP Opportunistic offloading CPU at low load GPU at high load Multi-functional, modular programming environment Integration with Click? [Kohler99] 40

#3 Multi-functional, modular programming environment Cascading NIC NIC NIC Recv? Y N? N Y? Chunk Subchunk Filter Gathering queue Schedulable task? N Module 1 Y Module 2 Module N Send NIC NIC NIC (drop) slow path (e.g. Linux TCP/IP stack)

Factors behind PacketShader 1.0 Performance Batching, batching, batching! The IO engine (modified NIC driver) uses continuous huge packet buffers called chunks. The user-level process pipelines multiple chunks. The GPU processes multiple chunks in parallel. Hardware-aware optimizations No NUMA node crossing Minimized cache conflicts among multi-cores

Remaining Challenges 100+ Gbps speed Stateful processing Intrusion detection systems / firewalls 43

Review of I/O Capacity for 100+ Gbps QuickPath Interconnect (QPI) CPU socket-to-socket link for remote memory IOH-to-IOH link for I/O traffic CPU-to-IOH for CPU to peripheral connections Today s QPI link 12.8 GB/s or 102.4 Gbps Sandy Bridge On recall at the moment Expected to boost performance to 60 Gbps w/o modification

Review of Memory B/W for 100+ Gbps For 100Gbps forwarding we need 400 Gbps in memory bandwidth + bookkeeping Current configuration triple-channel DDR3 1,333 MHz 32 GB/s per core (theoretical) and 17.9GB/s (empirical) On NUMA system More nodes Careful placement...

Future Work Consider other architectures AMD s APU Tilera s tiled many-core chips Inte s MIC Become a platform for all new FIA architectures Advantage over NetFPGA, ServerSwitch, ATCA solutions 46

For more details https://shader.kaist.edu/ QUESTIONS? THANK YOU! 47