High-Performance Packet Classification on GPU

Similar documents
High-Performance Packet Classification on GPU

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Performance Modeling and Optimizations for Decomposition-based Large-scale Packet Classification on Multi-core Processors*

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

High-throughput Online Hash Table on FPGA*

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Energy Optimizations for FPGA-based 2-D FFT Architecture

TUNING CUDA APPLICATIONS FOR MAXWELL

P4GPU: A Study of Mapping a P4 Program onto GPU Target

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

TUNING CUDA APPLICATIONS FOR MAXWELL

Scalable Enterprise Networks with Inexpensive Switches

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Parallel Exact Inference on the Cell Broadband Engine Processor

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

The dark powers on Intel processor boards

The Power of Batching in the Click Modular Router

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

CSE398: Network Systems Design

Benchmarking results of SMIP project software components

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

CS 179 Lecture 4. GPU Compute Architecture

Scalable Packet Classification on FPGA

Packet Classification Using Dynamically Generated Decision Trees

TUPLE PRUNING USING BLOOM FILTERS FOR PACKET CLASSIFICATION

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Portland State University ECE 588/688. Graphics Processors

Multi-Layer Packet Classification with Graphics Processing Units

Introduction to GPGPU and GPU-architectures

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Optimizations

SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification. Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H.

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

CS427 Multicore Architecture and Parallel Computing

Fast Tridiagonal Solvers on GPU

CS377P Programming for Performance GPU Programming - II

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

A Scalable Approach for Packet Classification Using Rule-Base Partition

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Programming in CUDA. Malik M Khan

Tesla Architecture, CUDA and Optimization Strategies

Multi2sim Kepler: A Detailed Architectural GPU Simulator

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

PFAC Library: GPU-Based String Matching Algorithm

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

IP Address Lookup and Packet Classification Algorithms

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Performance potential for simulating spin models on GPU

Fast Segmented Sort on GPUs

CUDA Performance Optimization. Patrick Legresley

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

G-NET: Effective GPU Sharing In NFV Systems

Optimizing Memory Performance for FPGA Implementation of PageRank

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

CUB. collective software primitives. Duane Merrill. NVIDIA Research

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Fast BVH Construction on GPUs

Chapter 5: CPU Scheduling

Energy Efficient Adaptive Beamforming on Sensor Networks

EE/CSCI 451: Parallel and Distributed Computation

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

High Performance Computing on GPUs using NVIDIA CUDA

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Accelerating String Matching Using Multi-threaded Algorithm

CS 179: GPU Programming. Lecture 7

CUDA. Matthew Joyner, Jeremy Williams

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

FPX Architecture for a Dynamically Extensible Router

Hands-on CUDA Optimization. CUDA Workshop

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Topic & Scope. Content: The course gives

DevoFlow: Scaling Flow Management for High Performance Networks

Flow Caching for High Entropy Packet Fields

Generic Polyphase Filterbanks with CUDA

Scientific Computing on GPUs: GPU Architecture Overview

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Stochastic Pre-Classification for SDN Data Plane Matching

A Configurable Packet Classification Architecture for Software- Defined Networking

Transcription:

High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 2

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 3

Introduction (1) Internet: Global system of interconnected computer networks Exponentially increasing network traffic Future Internet More network traffic Large amounts of data Changing more frequently 4

Introduction (2) Multi-field Packet Classification Applications Routing Access control in firewalls Provision of differentiated qualities of service OpenFlow flow table lookup 5

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 6

Multi-field Packet Classification (1) 5-Fields Source IP address Destination IP address Source port number Destination port number Protocol Src IP Des IP Src Port Des Port Protocol Bits 32 32 16 16 8 7

Multi-field Packet Classification (2) Rule-set A certain number of rules Matching criteria for each field Wildcard bit in the rule:1, 0 or (do not care) Priority Multiple matches choose the highest priority rule take the action 8

Multi-field Packet Classification (3) ID Src IP Des IP Src Port Des Port Protocol Priority ACTION 1 175.77.88.155/31 119.106.1 58.230/32 0-65535 6888-6888 0x06 1 Act 0 2 175.77.88.6/20 36.174.23 9.222/32 0-65535 1604-1704 0x06 2 Act 1 3 12.2.0.0/1 6 192.1.1.0/2 4 20-30 1024-1024 0x11 3 Act 2 9

Related Work Packet classification on GPU Relatively less explored Previous GPU implementations Unique Rules: small [1] Throughput or Latency not discussed [2] ~11 and 5 MPPS for 500 and 2000 rules [3] [1] A. Nottingham and B. Irwin, Parallel packet classification using GPU co-processors, in SAICSIT Conf.ACM., pp. 231-241, 2010. [2] C. Hung, Y. Lin, K. Li, H. Wang and S. Guo, Efficient GPGPUbased parallel packet classification, in Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 1367-1374, 2011. [3] K. Kang and Y. S. Deng, Scalable packet classification via GPU metaprograming, in Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1-4, 2011. 10

CUDA Programming Model CUDA program Host + Kernel Host function runs on CPU Kernel function runs on GPU 11

GPU Architecture 12

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 13

Contributions Range-tree search and bit vector (BV) based packet classifier on GPU Efficient range-tree search on GPU Optimize data layout to minimize shared memory bank conflict Throughput of 85 MPPS for 512-rule rule-set 14

Challenges Divergence Overhead Limited on-chip memory: data layout Classic Tree-Search: pointers to connect nodes Time Warp 0. 31 T T F T F F F F True: False: Action_a() Action_b() 15

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 16

Algorithm (1) Decomposition-based approach Range-tree Search 17

Algorithm (2) Bit Vector Representation i th bit i th original rule 1 match 0 not match Merge by Bit AND operation 0 1 0 0 1 0 & 1 0 & 1 1 = 1 0 match 1 1 1 1 18

Algorithm (3) 32 threads (a warp) per packet Pre-processing (In CPU): Partition rule-set into 32 subsets Construct range-trees & BVs for each subset Classification (In GPU): Phase 1: obtain an intermediate result (using the range-trees and BVs) Phase 2: intermediate results final result 19

Architecture Note: K =32 20

Optimizations (1) 21

Optimizations (2) Store range-trees in shared memory Data for Thread 1 Data for Thread 2 Data for Thread n Shared memory bank conflicts Row-major 22

Optimizations (3) Minimize shared memory bank conflicts Data for Thread 1 Data for Thread 2 Data for Thread n Column-major 23

Outline Introduction Background Contributions Algorithm Evaluation Conclusion and Future Work 24

Platform CPU (Intel E5-2665) Cores: 16 Frequency: 2.4 GHz GPU (NVIDIA K20 Kepler) Streaming Multi-Processor (SMX): 13 CUDA cores: 2496 Frequency: 705.5 MHz 25

Latency ( s) Performance (1) 30 No. of rules = 512 20 10 0 Column-major (With Shared-memory) Row-major (With Shared-memory) Without Shared-memory 26

Latency ( s) Performance (2) 100 Throughput 10 Latency Throughput (MPPS) 80 60 40 20 8 6 4 2 0 512 1024 2048 4096 0 512 1024 2048 4096 No. of rules No. of rules 27

Throughput (MPPS) Latency ( s) Performance (3) Best Case: smallest possible range-trees Worst Case: largest possible range-trees 120 Best Case 16 Best Case 90 Worst Case 12 Worst Case 60 8 30 4 0 512 1024 2048 4096 No. of rules 0 512 1024 2048 4096 No. of rules 28

Outline Introduction Background Summary of Contributions Algorithm Evaluation Conclusion and Future Work 29

Conclusions Range-trees + BV packet packet classifier on GPU: 85 MPPS for 512-rule rule-set Performance: throughput and latency number of rules (512 4096) data layout in shared memory Compared to state-of-the-art multi-core implementation: 2x improvement in throughput 30

Future Work Hash-based packet classification algorithms Other networking applications using GPUs Traffic classification OpenFlow flow table lookup 31

Thank you! Group Webpage: http://ganges.usc.edu/wiki/parallel_computing Email IDs: shijiezh@usc.edu, singapur@usc.edu, prasanna@usc.edu 32