P4GPU: A Study of Mapping a P4 Program onto GPU Target

Similar documents
PacketShader: A GPU-Accelerated Software Router

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

The Power of Batching in the Click Modular Router

T4P4S: When P4 meets DPDK. Sandor Laki DPDK Summit Userspace - Dublin- 2017

High-Performance Packet Classification on GPU

GPGPU introduction and network applications. PacketShaders, SSLShader

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

P51: High Performance Networking

Learning with Purpose

Dynamic Compilation and Optimization of Packet Processing Programs

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

CPU-GPU Heterogeneous Computing

PacketShader as a Future Internet Platform

Understanding The Performance of DPDK as a Computer Architect

! Readings! ! Room-level, on-chip! vs.!

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

Martin Dubois, ing. Contents

ADAPTING A SDR ENVIRONMENT TO GPU ARCHITECTURES

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Programmable Server Adapters: Key Ingredients for Success

Lecture 1: Gentle Introduction to GPUs

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Master Informatics Eng.

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

High Performance Computing

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Recent Advances in Software Router Technologies

P4 for an FPGA target

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

high performance medical reconstruction using stream programming paradigms

Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications

Hybrid Implementation of 3D Kirchhoff Migration

Programmable data planes, P4, and Trellis

CS427 Multicore Architecture and Parallel Computing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 11: Packet forwarding

High-Performance Holistic XML Twig Filtering Using GPUs. Ildar Absalyamov, Roger Moussalli, Walid Najjar and Vassilis Tsotras

GPUfs: Integrating a file system with GPUs

Introduction to GPU computing

A Reconfigurable Multiclass Support Vector Machine Architecture for Real-Time Embedded Systems Classification

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs

Accelerating Load Balancing programs using HW- Based Hints in XDP

A Next Generation Home Access Point and Router

Software Datapath Acceleration for Stateless Packet Processing

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Parallel Computing: Parallel Architectures Jin, Hai

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

High Performance Packet Processing with FlexNIC

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CSE 141 Summer 2016 Homework 2

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

ECE 8823: GPU Architectures. Objectives

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Massively Parallel Architectures

Scalable Name-Based Packet Forwarding: From Millions to Billions. Tian Song, Beijing Institute of Technology

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Portland State University ECE 588/688. Graphics Processors

The rcuda middleware and applications

FlexNIC: Rethinking Network DMA

CS516 Programming Languages and Compilers II

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Performance potential for simulating spin models on GPU

소프트웨어기반고성능침입탐지시스템설계및구현

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

Introduction to CUDA Programming

Topic & Scope. Content: The course gives

ODL SFC with OVS-DPDK, HW accelerated dataplane and VPP

SDN SEMINAR 2017 ARCHITECTING A CONTROL PLANE

Programming NFP with P4 and C

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Hashing Round-down Prefixes for Rapid Packet Classification

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Jakub Cabal et al. CESNET

GPGPU on Mobile Devices

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Predictive Runtime Code Scheduling for Heterogeneous Architectures

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

G-NET: Effective GPU Sharing In NFV Systems

Total Cost of Ownership Analysis for a Wireless Access Gateway

Native Offload of Haskell Repa Programs to Integrated GPUs

POST-SIEVING ON GPUs

Multi-Processors and GPU

Exploring GPU Architecture for N2P Image Processing Algorithms

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

Netronome NFP: Theory of Operation

CME 213 S PRING Eric Darve

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Transcription:

P4GPU: A Study of Mapping a P4 Program onto GPU Target Peilong Li, Tyler Alterio, Swaroop Thool and Yan Luo ACANETS Lab (http://acanets.uml.edu/) University of Massachusetts Lowell 11/18/15 University of Massachusetts Lowell 1

Background & Motivation P4 enables programmability of network data plane Software-based routing and forwarding: Programmability Line rate P4 hardware targets: CPU, NPU, reconfigurable switches, inics, FPGA, etc. GPGPU potentiality? Hundreds/thousands of cores, SIMD architecture, high memory bandwidth,mature parallel programming environment. 11/18/15 University of Massachusetts Lowell 2

Contributions P4-to-GPU: provide a viable way to map a portion of the P4 code onto GPU target. Acceleration: design a heterogeneous framework with CPU and GPU to accelerate performance. Optimization: optimize lookup and classification kernels on GPUand explore load balancing on the CPU/GPU framework. 11/18/15 University of Massachusetts Lowell 3

P4 to GPU: Overview Mapping P4 to GPU difficulties & hints Difficulties Lack of dedicated hardware (match+action) Inefficient kernel design leadsto performance disaster Coprocessor has limited memory and low clock frequency GPU is good at high parallelism (table lookup, matching ); bad at dependencyand branching (parsing, state machine ) Hints Take advantage ofmany cores Kernel requiresoptimization;auto-generated kernel may not render good performance Heterogeneous CPU/GPUarchitectureis desired Only offload parallelizable code to GPU. 11/18/15 University of Massachusetts Lowell 4

P4 to GPU: Software Stack Partially mapping 11/18/15 University of Massachusetts Lowell 5

P4 to GPU: Design Three steps of mapping: Step 1: P4 intermediate representation preparation Step 2: GPU kernel configuration generation Step 3: GPU kernel initialization P4 Program Step 1 Step 2 Step 3 HLIR P4 IR Python IR Parser LPM Lookup Configuration Classifier Configuration Modularized GPU Kernels LPM Lookup Engine Classifier Engine 11/18/15 University of Massachusetts Lowell 6

P4 to GPU: Code Diagram P4 Program... table ipv4_lpm { reads { ipv4.dstaddr: lpm; } actions { set_nhop; _drop; } size: 1024; }... P4 IR 1.p4_:ables \ 2pv4_lp4 \ 4a:c1: lp4 Step 1 Step 2 HLIR \ s2ze: 1 24 Parser \ 0orwar. \ 4a:c1: exac:... 1.p4_ac:2o5s... Kernel Configuration LPM Ker5el { s2ze: 1 24 02el.:.s:A..r 4a:c1: se:_51op 5o:_4a:c1: _.rop... A Class202er Ker5el { s2ze: 512 02el.: 51op_2pv4 4a:c1_:ype: exac: 4a:c1: se:_4ac 5o:_4a:c1: _.rop... A Step 3 Initialize LPM Kernel device lpm_lookup (...) {...} device classifier (...) {...} Classifier Kernel The GPU kernel conf parser: Parse h.p4_tables etc. in IR to obtain P4 table conf Find the order of tables * from the control flow. * Sequential tables for now 11/18/15 University of Massachusetts Lowell 7

P4 to GPU: GPU Kernel Design LPM lookup kernel: Used for ipv4_lpm table Designed for both IPv4 and IPv6 Baseline kernel: linear search Optimized kernel: binary trie and k-stride multibit trie Rule based classifier kernel: Used for forward and send_frame table Designed for both exact and wildcard match Baseline kernel: linear search Optimized kernel: grid-of-tries Binary Trie Multibit Trie Grid-of-trie 11/18/15 University of Massachusetts Lowell 8

P4 to GPU: GPU Kernel Design Latency hiding: Batch processing: reduce number of data copying 2D pipelining: Stream 1 Host H2D Device H2D Kernel D2H H2D Kernel D2H H2D Kernel D2H Kernel Stream 2 Host Device H2D Memory tweaking: Use texture memory for storing lookup tables Use mapped memory from host to device to reduce data movement overhead Data structure: Hash tables/tries are not GPU-favored data structure Trie-to-array vectorization H2D Kernel D2H H2D Kernel D2H Kernel D2H 11/18/15 University of Massachusetts Lowell 9

Architectural Design Components: CPU, main memory, GPU Input Packets Output Packets GPU Texture Memory for tables Global Memory for packets Step 1 Step 5 Step 4 Step 4 Step 3 Result Buffers Memory Packet Buffers Step 2 Load Balancer Step 3 Mapped Memory CPU Step 1: receiving Step 2: batching Step 3: load-balancing and offloading Step 4: results buffering Step 5: forwarding actions 11/18/15 University of Massachusetts Lowell 10

Evaluation: Setup Experiment platform: CPU: Intel Quad Core i7-3610 QM @ 2.3 GHz Main memory: 8 GB GPU: Nvidia GT 650M with 384 cores Dataset: RouteView(Jan 2015) IPv4 (550,000 entries), IPv6 (20,000 entries) ClassBench: a set of filters, i.e. FW and ACL Traffic generation: Random generator: assumes no packet IO overhead Click Modular Router: emulated packet sending/receiving(socket-based) 11/18/15 University of Massachusetts Lowell 11

Evaluation: Results Lookup Kernel Classifier Kernel 11/18/15 University of Massachusetts Lowell 12

Evaluation: Results Throughput and Latency 11/18/15 University of Massachusetts Lowell 13

Conclusion & Future Work P4GPU tool provides a viable way to map P4 code onto GPU target. GPU is a promising target for P4 to achieve high throughput and low latency forwarding. Integrate P4 table dependency to P4GPU tool Study the possibility of optimized GPU kernel auto generation. Implement GPU kernels for more functionalities, e.g. OpenFlow. 11/18/15 University of Massachusetts Lowell 14

That s All 11/18/15 University of Massachusetts Lowell 15