Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Similar documents
Towards High-performance Flow-level Packet Processing on Multi-core Network Processors

Packet Classification and Pattern Matching Algorithms for High Performance Network Security Gateway

Packet Classification: From Theory to Practice

Towards Optimized Packet Classification Algorithms for Multi-Core Network Processors

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Towards System-level Optimization for High Performance Unified Threat Management

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

AN EFFICIENT HYBRID ALGORITHM FOR MULTIDIMENSIONAL PACKET CLASSIFICATION

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Performance Evaluation and Improvement of Algorithmic Approaches for Packet Classification

Packet Classification Algorithms: From Theory to Practice

Fast Packet Classification Algorithms

ECE697AA Lecture 21. Packet Classification

Multi-dimensional Packet Classification on FPGA: 100 Gbps and Beyond

DBS: A Bit-level Heuristic Packet Classification Algorithm for High Speed Network

SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification. Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H.

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Implementation of Boundary Cutting Algorithm Using Packet Classification

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Architecture-Aware Session Lookup Design for Inline Deep Inspection on Network Processors *

On the Difficulty of Scalably Detecting Network Attacks

TOWARDS EFFECTIVE PACKET CLASSIFICATION

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

High-Speed Network Processors. EZchip Presentation - 1

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

MULTI-MATCH PACKET CLASSIFICATION BASED ON DISTRIBUTED HASHTABLE

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Network Processors. Nevin Heintze Agere Systems

Application of SDN: Load Balancing & Traffic Engineering

LOAD SCHEDULING FOR FLOW-BASED PACKET PROCESSING ON MULTI-CORE NETWORK PROCESSORS

High-Performance Packet Classification Algorithm for Multithreaded IXP Network Processor

Housekeeping. Fall /5 CptS/EE 555 1

Scalable Packet Classification on FPGA

Scalable Enterprise Networks with Inexpensive Switches

FPX Architecture for a Dynamically Extensible Router

Three Different Designs for Packet Classification

Packet Classification. George Varghese

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

15-744: Computer Networking. Routers

Embark: Securely Outsourcing Middleboxes to the Cloud

DESIGN AND EVALUATION OF PACKET CLASSIFICATION SYSTEMS ON MULTI-CORE ARCHITECTURE SHARIFUL HASAN SHAIKOT

Topic & Scope. Content: The course gives

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Efficiency of Cache Mechanism for Network Processors

L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4.

High Performance Memory Opportunities in 2.5D Network Flow Processors

ITTC High-Performance Networking The University of Kansas EECS 881 Packet Switch I/O Processing

Packet Classification Algorithms: A Survey

Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

Scalable Packet Classification on FPGA

Introduction CHAPTER 1

Efficient Packet Classification for Network Intrusion Detection using FPGA

Requirement Discussion of Flow-Based Flow Control(FFC)

A closer look at network structure:

QoS support for Intelligent Storage Devices

Media Access Control (MAC) Sub-layer and Ethernet

Extensible Network Security Services on Software Programmable Router OS. David Yau, Prem Gopalan, Seung Chul Han, Feng Liang

Dynamic Pipelining: Making IP- Lookup Truly Scalable

A Framework for Rule Processing in Reconfigurable Network Systems

Configurable Number of Simultaneous Packets per Flow

DevoFlow: Scaling Flow Management for High-Performance Networks

TOC: Switching & Forwarding

CA485 Ray Walshe Google File System

Hashing Round-down Prefixes for Rapid Packet Classification

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

TOC: Switching & Forwarding

Commercial Network Processors

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

A 400Gbps Multi-Core Network Processor

Packet Classification Using Standard Access Control List

Selective Boundary Cutting For Packet Classification SOUMYA. K 1, CHANDRA SEKHAR. M 2

Hierarchically Aggregated Fair Queueing (HAFQ) for Per-flow Fair Bandwidth Allocation in High Speed Networks

Design of a Web Switch in a Reconfigurable Platform

Reliably Scalable Name Prefix Lookup

Homework 1 Solutions:

CS 268: Route Lookup and Packet Classification

ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification

Last Lecture: Network Layer

Growth of the Internet Network capacity: A scarce resource Good Service

High-Performance Packet Classification on GPU

Scalable Packet Classification Using Interpreting. A Cross-platform Multi-core Solution

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Cisco ASR 1000 Series Aggregation Services Routers: QoS Architecture and Solutions

A Multi Gigabit FPGA-based 5-tuple classification system

Internet QoS 1. Integrated Service 2. Differentiated Service 3. Linux Traffic Control

Feature Rich Flow Monitoring with P4

Data Structures for Packet Classification

Netronome NFP: Theory of Operation

Graph Streaming Processor

Packet Classification for Core Routers: Is there an alternative to CAMs?

Design of a Multi-Dimensional Packet Classifier for Network Processors

Memory-Efficient 5D Packet Classification At 40 Gbps

Transcription:

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors Yaxuan Qi (presenter), Bo Xu, Fei He, Baohua Yang, Jianming Yu and Jun Li ANCS 2007, Orlando, USA

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Introduction Why flow-level packet processing with high-performance? increasing sophistication of applications stateful firewalls deep inspection in IDS/IPS flow-based scheduling in load balancers continual growth of network bandwidth oc192 or higher link speed 1 million or more concurrent connections

Introduction Problems in flow-level packet processing: Flow classification: Importance: access control and protocol analysis difficulty: high-speed with modest memory Flow state management: Importance: stateful firewall and anti-dos Difficulty: fast update with large connections Per-flow packet order-preserving: Importance: content inspection Difficulty: mutual exclusion and workload distribution

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Related Work on Multi-core NP Intel IXP2850

Related Work on Multi-core NP Programming Challenges: Achieving a deterministic bound on packet processing operation line rate constraint clock cycles to process the packet should have an upper bound Masking memory latency through multi-threading: memory latencies are typically much higher than the amount of processing budget Preserving packet order in spite of parallel processing: extremely critical for applications like media gateways and traffic management.

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Flow Classification Related work: D. Srinivasan and W. Feng, Lucent Bit-Vector On the Intel IXP1200 NP Only support 512 rules D. Liu, B. Hua, X. Hu and X. Tang, Bitmap RFC Achieves near line speed on the Intel IXP2800. 100MB+ SRAM memory for thousands of rules Our study, Aggregated Cuttings (AggreCuts) Near line speed on IXP2850 Consumes less than 10MB SRAM

Flow Classification Algorithms Field-independent Search Algorithms Field-dependent Search Algorithms Trie-Based Algorithms Table-Based Algorithms Trie-Based Algorithms Decision-Tree Algorithms Bit-Map to store rules Bit-Map Aggregation BV CP Prefix Match Equivalent Match Index Search Binary Search No Back Tracking H-Trie SP-Trie Bit-Test Modular Single-Field Range-Test Multi-Field Folded Bit-Map Aggregation ABV AFBV Bit-Map Aggregation RFC B-RFC HSM No Rule Duplication Extend to Multiple Fields GoT EGT Bit-Map Aggregation HiCuts AggreCuts HyperCuts

Flow Classification Why not HiCuts? Non-deterministic worst-case search time Due to heuristics used for #cuts Excessive memory access due to linear search on leaf-nodes (8Rules, <3Gbps on IXP28xx) Our motivations: Fix the number of cuttings at internal-nodes: If the number of cuttings is fixed to 2 w, then a worst-case bound of O(W/w) is achieved (where W is header width, and w is stride) Eliminate linear search at leaf-nodes: Linear search can be eliminated if we keep cutting until every subspace is full-covered by a certain set of rules. Consider the common 5-tuple flow classification problem W=104, set w=8, then the worst-case search time 104/8=13 (nearly the same as RFC) No linear search is required

Flow Classification Space Aggregation

Flow Classification Data-structure Bits Description Value 31:30 dimension to d2c=00: src IP; d2c=01: dst IP; Cut (d2c) d2c=10: src port; d2c=11: dst port. 29:28 bit position to b2c=00: 31~24; b2c=01: 23~16 Cut (b2c) b2c=10: 15~8; b2c=11: 7~0 27:20 8-bit HABS if w=8, each bit represent 32 cuttings; if w=4, each bit represent 2 cuttings. 19:0 The minimum memory block is 2 w /8*4 20-bit Byte. So if w=8, 20-bit base address Next-Node support 128MB memory address space; CPA Base if w=4, it supports 8MB memory address address space.

Flow Classification Performance Evaluation Memory Usage: an order of magnitude Memory Access: 3~8 times less Throughput on IXP2850: 100000 HiCuts AggreCuts-4 AggreCuts-8 10000 3~5 times faster Memory Accesses (32-bit words) 100 90 80 70 60 50 40 30 20 10 0 10000 9000 8000 HiCuts AggreCuts-4 AggreCuts-8 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Rule Sets 7000 Memory Usage (MB) 1000 100 Throughput (Mbps) 6000 5000 4000 3000 2000 HiCuts AggreCuts-4 AggreCuts-8 1000 10 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Rule Sets 0 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Rule Sets

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Flow State Management Flow state management: Problem: A large number of updates over a short period of time Line speed update Solution: Hashing with exact match Collision and computation Our aim: Supporting large concurrent sessions with extremely low collision rate More than 10M session Less than 1% collision rate Achieving fast update speed using both SRAM and DRAM Near line speed update rate

Flow State Management Signature-based Hashing (SigHash) m signatures for m different states with same hash value Resolving collision in SRAM (fast, word-oriented) Storing states in DRAM (large, burstoriented)

Flow State Management Performance Evaluation Throughput 10Gbps Connections 10M Collision Less than 1% Depends on different load factors Exception rate Throughput (Gbps) 12 10 8 6 4 2 0 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Di r ect Hash Si ghash 8 16 24 32 40 48 56 64 Number of Threads 4 2 1 0.5 0.25 0.125 Load factor

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Per-flow Packet Ordering Packet Order-preserving Typically, only required between packets on the same flow. External Packet Order-preserving (EPO) Sufficient for devices processing packets at network layer. Fine-grained workload distribution (packet-level) Need locking Internal Packet Order-preserving (IPO) Required by applications that process packets at semantic levels. Coarse-grained workload distribution (flow-level) Do not need locking

Per-flow Packet Ordering External Packet Order-preserving (EPO) Ordered-thread Execution Ordered critical section to read the packet handles off the scratch ring. The threads then process the packets, which may get out of order during packet processing. Another ordered critical section to write the packet handles to the next stage. Mutual Exclusion by Atomic Operation Packets belong to the same flow may be allocated to different threads to process Mutual exclusion can be implemented by locking. SRAM atomic instructions

Per-flow Packet Ordering Internal Packet Order-preserving (IPO) SRAM Q-Array Workload Allocation by CRC Hashing on Headers

Per-flow Packet Ordering Packet drop rate Throughput (Gbps) 12 10 8 6 4 2 0 0.12 0.1 0.08 0.06 0.04 IPO EPO 8 16 24 32 40 48 56 64 Number of Threads Queue Length=512 Queue Length=1024 Queue Length=2048 Performance Evaluation Throughput EPO is faster, 10Gbps IPO has linear speed up, 7Gbps Workload Allocation CRC is good (though, Zipf-like) While can be better 0.02 0 0 1 2 3 4 5 6 7 8 Time (sec)

Outline Introduction Related Work on Multi-core NP Flow-level Packet Processing Flow Classification Flow State Management Per-flow Packet Ordering Summary

Summary Contribution: An NP-optimized flow classification algorithm : Explicit worst-case search time: 9Gbps hierarchical bitmap space aggregation: upto 16:1 An efficient flow state management scheme: Fast update rate: 10Gbps Exploit memory hierarchy: 10M connection, low collision rate Two hardware-supported packet order-preserving schemes : EPO via ordered-thread execution: 10Gbps IPO via SRAM queue-array: 7Gbps Future work Adaptive decision tree algorithm on for different memory hierarchy? SRAM SYN-cookie for fast session creation? Flow-let workload distribution?

Thanks Questions?