Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Similar documents
Ruler: High-Speed Packet Matching and Rewriting on NPUs

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

소프트웨어기반고성능침입탐지시스템설계및구현

Basics of Performance Engineering

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

P51: High Performance Networking

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

6.9. Communicating to the Outside World: Cluster Networking

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Computer System Components

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Memory latency: Affects cache miss penalty. Measured by:

A 400Gbps Multi-Core Network Processor

Martin Kruliš, v

IsoStack Highly Efficient Network Processing on Dedicated Cores

CEC 450 Real-Time Systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

Fast Deep Packet Inspection with a Dual Finite Automata

Network Processors. Nevin Heintze Agere Systems

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Deep Packet Inspection of Next Generation Network Devices

Memory latency: Affects cache miss penalty. Measured by:

Measurement-based Analysis of TCP/IP Processing Requirements

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Page 1. Multilevel Memories (Improving performance using a little cash )

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Learning with Purpose

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Mainstream Computer System Components

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Network Processors and their memory

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

CSCE 212: FINAL EXAM Spring 2009

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

Commercial Network Processors

Memory. Lecture 22 CS301

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

RouteBricks: Exploiting Parallelism To Scale Software Routers

Advanced Memory Organizations

Two Level State Machine Architecture for Content Inspection Engines

Re-architecting Virtualization in Heterogeneous Multicore Systems

Cache Memory and Performance

Design and Performance of the OpenBSD Stateful Packet Filter (pf)

Dynamic Performance Tuning for Speculative Threads

Much Faster Networking

Feature Rich Flow Monitoring with P4

Netronome NFP: Theory of Operation

Networking in a Vertically Scaled World

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Language-Directed Hardware Design for Network Performance Monitoring

EECS 570 Final Exam - SOLUTIONS Winter 2015

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Advanced Computer Networks. End Host Optimization

Topic & Scope. Content: The course gives

High-Speed Network Processors. EZchip Presentation - 1

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Portland State University ECE 588/688. Cray-1 and Cray T3E

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

An Approach Towards Enabling Intelligent Networking Services for Distributed Multimedia Applications

Packet Manipulator Processor: A RISC-V VLIW core for networking applications

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

File Size Distribution on UNIX Systems Then and Now

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Bespoke Processors for Applications with Ultra-low Area and Power Constraints

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

CS 654 Computer Architecture Summary. Peter Kemper

Parallel Programming Multicore systems

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

DevoFlow: Scaling Flow Management for High Performance Networks

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Xen Network I/O Performance Analysis and Opportunities for Improvement

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

Multimedia Streaming. Mike Zink

ClearSpeed Visual Profiler

Transcription:

Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 1 / 20

Motivation Why packet pattern matching? Protocol header inspection IP forwarding Content based routing and load-balancing Bandwidth throttling, etc. Deep packet inspection Required by intrusion detection and preventions systems (IDPS) Inspecting IP and TCP layer headers is not sufficient The payload contains malicious data Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 2 / 20

Motivation Why packet rewriting? Anonymization We need to store traffic traces Network users are afraid of misuse of their data and identity ISPs want to protect their customers Data reduction The amount of data in the Internet is huge Applications need only data of their interest The data reduction must be online! Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 3 / 20

The Ruler goals Motivation a system for packet classification based on regular expressions a system for packet rewriting a system deployable on the network edge a system easily portable to other architectures Ruler provides all of these! Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 4 / 20

The Ruler program Ruler The language filter udp header:(byte#12 0x800~2 byte#9 17 byte#2) address:(192 168 1 byte) tail:* => header 0#4 tail; A program (filter) is made up of a set of rules Each rule has the form pattern => action; Each rule has an action part accept <number> reject rewrite pattern (e.g., header 0#4 tail) Labels (e.g., header, addresss, tail) refer to sub-patterns Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 5 / 20

The Ruler templates Ruler The language Often used patterns can be defined as templates pattern Ethernet : (dst:byte#6 src:byte#6 proto:byte#2) Templates can use other templates for more specific patterns pattern Ethernet_IPv4 : Ethernet with [proto=0x0800~2] filter ether e:ethernet_ipv4 t:* => e with [src=0#6] t; Ruler program can include files with templates include "layouts.rli" Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 6 / 20

Parallel pattern matching Ruler The implementation Deterministic Finite Automaton for matching multiple patterns state types inspection, memory inspection, jump, tag, accept Ruler remembers position of sub-patterns - Tagged DFA (TDFA) filter byte42 * 42 b:(byte 42) * => b; Position of label b is determined only at runtime DFA contains tag states to record the position in a tag-table - 0 2 4 5 0 1 3 0 0 1 { 0 1 } 2 { 0 0 } 4 0 0 0 - { 0 0 } 2 42 42 {0 0 } 3 42 - - - 6 0 { 0 0 } 3 42-7 8 0 { 0 { 0 1 } 3 1 } 4 42 - Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 7 / 20

Network processors Why is it so difficult to use NPUs? Parallelism It is difficult to think parallel and NPUs employ various parallelism techniques : multiple execution units or threads, pipelines Poor code portability Various C dialects Too many features to exploit declspec(shared gp_reg) declspec(sram) declspec(shared scratch) declspec(dram_read_reg) IXP2xxx Hierarchy of asynchronous memories (Scratch, SRAM, DRAM) Many cores with hardware multi-threading (micro-engines - ME) Special instructions, atomic memory operations, queues, etc. Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 8 / 20

Why use NPUs? Network processors Running on bare-metal with minimal overhead Embedded in routers, switches and smart NICs Worst case guarantees number of available cycles exact memory latency no speculative execution or caching Hardware acceleration PHY integrated into the chip hashing units crypto units CAM fast queues Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 9 / 20

Ruler on the IXP2xxx The implementation Dedicated RX and TX engines All other engines execute up to 8 Ruler threads Only one thread per ME is polling on the RX queue to reduce memory load and execution resources Each thread processes independently a single packet Only RX and TX queues synchronize the threads Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 10 / 20

Inspection states The implementation Inspection states are the most often executed need optimization Reading the next byte from the input No DRAM latency due to prefetching Faster reading from positions known in compile time (headers) Skipping bytes of no interest Multi-way branch Select the transition to the next state Has the most impact on the performance The default branch is the one taken most frequently We have two implementations : Naive Binary tree with default branch promotion Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 11 / 20

The implementation Binary tree switch statements Binary tree Test multiple values by checking single bits, one at a time 0... 9 < 64 a... z A... Z < 128 We select the bit that puts most of the default values in one subtree Testing a bit takes 1 cycle The "jump" branch takes 3 extra cycles We make fall-through branch the subtree with more defaults It is a heuristic Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 12 / 20

The implementation Naive vs. binary tree switch statements Naive alu[--, act_char, -, 47] blt[state_20#] alu[--, act_char, -, 120] bge[state_20#] br=byte[act_char, 0, 47, STATE_24#] br=byte[act_char, 0, 110, STATE_26#] br=byte[act_char, 0, 112, STATE_23#] br=byte[act_char, 0, 115, STATE_33#] br=byte[act_char, 0, 117, STATE_22#] br=byte[act_char, 0, 119, STATE_21#] br[state_20#] Binary tree alu[-, act_char, -, 47] blt[state_20#] br_bclr[act_char, 5, STATE_20#] br_bclr[act_char, 0, BIT_BIN_33_31#] br_bset[act_char, 2, BIT_BIN_33_32#] br[state_20#] BIT_BIN_33_32#: br_bclr[act_char, 1, BIT_BIN_33_33#] br_bset[act_char, 3, BIT_BIN_33_34#] br_bset[act_char, 4, BIT_BIN_33_35#] br[state_20#] BIT_BIN_33_35#:... Default branch is taken after 2 cycles in contrast to 10 if bit 5 is not set Measured up to 10% overall speedup Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 13 / 20

The implementation Naive vs. binary tree switch statements Naive alu[--, act_char, -, 47] blt[state_20#] alu[--, act_char, -, 120] bge[state_20#] br=byte[act_char, 0, 47, STATE_24#] br=byte[act_char, 0, 110, STATE_26#] br=byte[act_char, 0, 112, STATE_23#] br=byte[act_char, 0, 115, STATE_33#] br=byte[act_char, 0, 117, STATE_22#] br=byte[act_char, 0, 119, STATE_21#] br[state_20#] Binary tree alu[-, act_char, -, 47] blt[state_20#] br_bclr[act_char, 5, STATE_20#] br_bclr[act_char, 0, BIT_BIN_33_31#] br_bset[act_char, 2, BIT_BIN_33_32#] br[state_20#] BIT_BIN_33_32#: br_bclr[act_char, 1, BIT_BIN_33_33#] br_bset[act_char, 3, BIT_BIN_33_34#] br_bset[act_char, 4, BIT_BIN_33_35#] br[state_20#] BIT_BIN_33_35#:... Default branch is taken after 2 cycles in contrast to 10 if bit 5 is not set Measured up to 10% overall speedup Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 13 / 20

The implementation Executed vs. interpreted states Instruction store is limited executed and interpreted states Simplified DFA, loop edges are missing Number of states may explode exponentialy Experiments show that hot states are few and they are close to the initial state We move distant states to off-chip memory We also move states that are too expensive The code must include stubs to start the interpreter that reads transitions from a table in SRAM The iteration stops once the code fits in the instruction store Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 14 / 20

The implementation Executed vs. interpreted states Instruction store is limited executed and interpreted states Simplified DFA, loop edges are missing Number of states may explode exponentialy Experiments show that hot states are few and they are close to the initial state We move distant states to off-chip memory We also move states that are too expensive The code must include stubs to start the interpreter that reads transitions from a table in SRAM The iteration stops once the code fits in the instruction store Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 14 / 20

The implementation Executed vs. interpreted states Instruction store is limited executed and interpreted states Simplified DFA, loop edges are missing Number of states may explode exponentialy Experiments show that hot states are few and they are close to the initial state We move distant states to off-chip memory We also move states that are too expensive The code must include stubs to start the interpreter that reads transitions from a table in SRAM The iteration stops once the code fits in the instruction store Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 14 / 20

The implementation Executed vs. interpreted states Instruction store is limited executed and interpreted states Simplified DFA, loop edges are missing Number of states may explode exponentialy Experiments show that hot states are few and they are close to the initial state We move distant states to off-chip memory We also move states that are too expensive The code must include stubs to start the interpreter that reads transitions from a table in SRAM The iteration stops once the code fits in the instruction store Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 14 / 20

The implementation Executed vs. interpreted states Instruction store is limited executed and interpreted states Simplified DFA, loop edges are missing Number of states may explode exponentialy Experiments show that hot states are few and they are close to the initial state We move distant states to off-chip memory We also move states that are too expensive The code must include stubs to start the interpreter that reads transitions from a table in SRAM The iteration stops once the code fits in the instruction store Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 14 / 20

Evaluation Limits of the IXP2400 Clock cycles 29 to 36 cycles per byte (1518 to 64 bytes ethernet frames) Interpreted inspection states consume at most 35 cycles per byte IXP28xx has about 5.4 more cycles per byte Memory size Rewriting Instruction store 4k instructions up to 200 states SRAM up to 32MB up to 64k states Expensive unaligned access to DRAM Fast but tiny local memory for constructing packets Only a single thread per ME can do rewriting Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 15 / 20

Benchmark filters Evaluation filter states instructions insns/state interpreted states anon 19 641 30.05 anonhdr 19 641 30.05 backdoor 2441 46041 18.83 2147 large 2327 19216 8.23 2141 payload 24 400 13.75 null 1 145 6.00 Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 16 / 20

Evaluation Pattern-matching performance packet size 64 packet size 96 packet size 546 packet size 1518 531.9 Mbit/s 751.2 Mbit/s 962.1 Mbit/s 990.8 Mbit/s # of MEs 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 backdoor 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 large 30 0 0 0 0 0 69 34 5 0 0 0 85 70 54 39 24 9 86 72 58 44 30 16 payload 3 0 0 0 0 0 50 0 0 0 0 0 71 43 14 0 0 0 73 46 20 0 0 0 null 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Percentage of dropped (not processed) packets drop% 90 80 70 60 50 40 30 20 10 0 Large 1 2 3 4 5 6 number of MEs 64B 96B 546B 1518B Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 17 / 20

Evaluation Rewriting performance Synthetic traffic packet size 64 packet size 96 packet size 546 packet size 1518 531.9 Mbit/s 751.2 Mbit/s 962.1 Mbit/s 990.8 Mbit/s # of MEs 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 anon 70 47 26 16 7 8 70 45 28 20 15 18 52 19 0 0 0 0 44 6 0 0 0 0 anonhdr 55 18 3 0 0 0 52 17 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Real traffic av. pkt size 305.0 829.0 Mbit/s # of MEs 1 2 3 4 5 6 anonym 78 37 2 0 0 0 anonymhdr 3 0 0 0 0 0 drop% 70 60 50 40 30 20 10 Anonym 0 1 2 3 4 5 6 number of MEs 64B 96B 546B 1518B Percentage of dropped (not processed) packets Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 18 / 20

Summary Summary We developed Ruler - a language and compiler Ruler supports a wide range of architectures including NPUs, FPGAs and standard general-purpose CPUs Ruler offers pattern matching and packet rewriting Ruler makes programming NPUs simple Ruler is directly portable to current and upcoming multi-core chips e.g., Niagara1 and Niagara2 We evaluated Ruler on real hardware using Intel IXP 2400 Sponsors : EU FP6 Lobster project, Intel IXA University Program Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 19 / 20

Thank you for your attention Questions... Sponsors : EU FP6 Lobster project, Intel IXA University Program Tomáš Hrubý (VU Amsterdam, World45) Ruler on NPUs December 3, 2007 20 / 20