IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

Similar documents
Last Lecture: Network Layer

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Computer Sciences Department

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

PLUG: Flexible Lookup Modules for Rapid Deployment of New Protocols in High-speed Routers

Homework 1 Solutions:

Midterm Review. Congestion Mgt, CIDR addresses,tcp processing, TCP close. Routing. hierarchical networks. Routing with OSPF, IS-IS, BGP-4

ECE697AA Lecture 21. Packet Classification

Flexible Lookup Modules for Rapid Deployment of New Protocols in High-speed Routers

Dynamic Pipelining: Making IP- Lookup Truly Scalable

Data Structures for Packet Classification

ITTC High-Performance Networking The University of Kansas EECS 881 Packet Switch I/O Processing

Scalable Name-Based Packet Forwarding: From Millions to Billions. Tian Song, Beijing Institute of Technology

Growth of the Internet Network capacity: A scarce resource Good Service

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Scalable Lookup Algorithms for IPv6

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Multi-gigabit Switching and Routing

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Computer Networks CS 552

Flow Caching for High Entropy Packet Fields

A Pipelined IP Address Lookup Module for 100 Gbps Line Rates and beyond

15-744: Computer Networking. Routers

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Chapter 12 Digital Search Structures

Lecture 11: Packet forwarding

ECE697AA Lecture 20. Forwarding Tables

FPGA Implementation of Lookup Algorithms

Frugal IP Lookup Based on a Parallel Search

Cycle Time for Non-pipelined & Pipelined processors

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Router Design: Table Lookups and Packet Scheduling EECS 122: Lecture 13

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

P51: High Performance Networking

CS 268: Route Lookup and Packet Classification

LONGEST prefix matching (LPM) techniques have received

Binary Search Schemes for Fast IP Lookups

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Packet Classification. George Varghese

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Parallel-Search Trie-based Scheme for Fast IP Lookup

Router Architectures

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

Modern Computer Architecture

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Deep Packet Inspection of Next Generation Network Devices

SWSL: SoftWare Synthesis for Network Lookup

CS 552 Computer Networks

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

A 400Gbps Multi-Core Network Processor

LEAP: Latency- Energy- and Area-optimized Lookup Pipeline

Novel Hardware Architecture for Fast Address Lookups

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Memory. Lecture 22 CS301

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Journal of Network and Computer Applications

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Design and Implementation of the PLUG Architecture for Programmable and Efficient Network Lookups

Professor Yashar Ganjali Department of Computer Science University of Toronto.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 13: Address Translation

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

HIGH-PERFORMANCE PACKET PROCESSING ENGINES USING SET-ASSOCIATIVE MEMORY ARCHITECTURES

1 Connectionless Routing

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Novel Hardware Architecture for Fast Address Lookups

High-Performance Network Data-Packet Classification Using Embedded Content-Addressable Memory

University of Alberta. Sunil Ravinder. Master of Science. Department of Computing Science

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Lecture 5: Router Architecture. CS 598: Advanced Internetworking Matthew Caesar February 8, 2011

Routing architecture and forwarding

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

Master Course Computer Networks IN2097

Memory Hierarchy Design for a Multiprocessor Look-up Engine

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Cache memory. Lecture 4. Principles, structure, mapping

Routing Lookup Algorithm for IPv6 using Hash Tables

Tree-Based Minimization of TCAM Entries for Packet Classification

Shape Shifting Tries for Faster IP Route Lookup

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers

PacketShader: A GPU-Accelerated Software Router

Shape Shifting Tries for Faster IP Route Lookup

OpenFlow Software Switch & Intel DPDK. performance analysis

PacketShader as a Future Internet Platform

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Cisco ASR 1000 Series Aggregation Services Routers: QoS Architecture and Solutions

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Efficient Construction Of Variable-Stride Multibit Tries For IP Lookup

Question?! Processor comparison!

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

Master Course Computer Networks IN2097

Routers: Forwarding EECS 122: Lecture 13

CSE 123A Computer Networks

Network Processors and their memory

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

Novel IP Address Lookup Algorithm for Inexpensive Hardware Implementation

Lecture 9: MIMD Architectures

Routers: Forwarding EECS 122: Lecture 13

Dynamic Pipelining: Making IP-Lookup Truly Scalable

Transcription:

IP Forwarding CSU CS557, Spring 2018 Instructor: Lorenzo De Carli 1

Sources George Varghese, Network Algorithmics, Morgan Kauffmann, December 2004 L. De Carli, Y. Pan, A. Kumar, C. Estan, K. Sankaralingam, PLUG: Flexible Lookup Modules for Rapid Deployment of New Protocols in High-speed Routers, SIGCOMM 2009 2

IP forwarding Input: a source IP address Output: an output port Defines the link on which the packet must be sent IP.dst: 129.82.103.121 Port 1 Port 2 Port 3... Port 16 3

IP forwarding - II Internally, router determines output port by looking up a routing table Routing table 49.* Port 3 129.82.* Port 2 151.1.1.* Port 12 IP.dst: 129.82.103.121 Port 1 Port 2 Port 3... Port 16 4

Longest prefix matching There are 4B+ possible 32-bit IP addresses Not really (due to reserved/private IP address spaces), but still There are 2 128 possible 128-bit IPv6 addresses No data structure can possibly encode individual (destination IP, port) for each possible destination 5

Longest prefix matching - II What do we do then? Longest prefix matching Observation: IP addresses are not randomly assigned Organizations receive blocks of IP address space Each block defined by IP+mask: e.g. 129.82.103.0/24 129 82 103 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 X 24 X 6

Longest prefix matching - III Organizations receive blocks of IP address space Each block defined by IP+mask: e.g. 129.82.103.0/24 Machines in this block will be assigned addresses with the 129.82.103 prefix, e.g., 129.82.103.1, 129.82.103.24, 129.82.103.11, etc. Implication: IP addresses which share the same prefix also tend to share the same route 7

Longest prefix matching - IV Route aggregation example Routing table 129.82.103.* Port 1 Port 1 129.82.103.0/24 128.82.103.121 Routing table 129.* Port 1 Routing table 129.82.* Port 3 Port 3 Port 6 129.82.0.0/16 IP.dst: 129.82.103.121 Port 1 Port 12 129.0.0.0/8 8

Longest prefix matching - V Why longest? Given a destination IP address, there may be multiple entries in the routing table with matching prefixes of different length In this case, the most specific (longer) prefix is chosen 9

Longest prefix matching - VI Longest prefix matching - general description: Identify all rules whose prefix matches the destination IP address from the current packet Among all the results, pick the rule with the longest prefix Routing table 129.82.103.* Port 1 129.82.103.121 Port 6 10

Longest prefix matching - VII Problems with longest prefix matching: Routing tables may contain hundreds of thousands of rules Need to find all rules that mach a certain IP, pick the one with the longest prefix Need to do it fast! (1 Tb/s w/ 512B packets: 250M+ packets/s) Impossible to use linear search Routing tables are not really tables Need optimized data structure! 11

Efficient longest prefix matching Router designers have looked at a number of approaches to efficient longest prefix matching: Hardware-based approaches (specialized memory) TCAMs Software-based approaches (SRAM + optimized data structures) Tries Lulea Tries Tree bitmaps 12

TCAMs TCAM: Ternary Content-Access Memory Content-Access Memory: a hardware memory which is indexed by keys instead of memory addresses Think of it as an hardware key-value store Given a key, returns the corresponding value Ternary: not all key bits need to be specified in each entry Some bits can be set to don't care Those parts of the key always match 13

TCAM - example Keys: 3-bit wide Values: 4-bit wide Key 1 0 1 Keys Values 1 * * 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 0 * 0 0 1 1 1 0 1 0 1 0 0 14

How is this useful? Route prefixes can be directly stored as keys The part of the prefix which is not covered by the mask is set as don t care Output ports stored as values Given a search key, among all active entries pick the one with the least don't care bits 15

TCAM+LPM - Example Key 129 82 103 121 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 1 Routing table 129 82 102 * 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 * * * * * * * * 129 82 103 * 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 * * * * * * * * 129 82 * * 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 * * * * * * * * * * * * * * * * 16

TCAM - pros and cons Advantages: Fast! Can search or update in 1 memory cycle (~10ns) Can handle large routing tables (~100000 prefixes) Disadvantages: Area-hungry (12 transistors per bit - compare to 6 transistors per bit required by regular SRAM) Power-hungry (because of the parallel compare between keys and all entries) Arbitration delay (~5 ns to pick winning entry) Require dedicated chip Can we have an alternative approach not requiring specialized memory? 17

LPM - memory-based approaches Easiest: use generic data structure E.g. potentially slow, unpredictable latency Most conventional structures are designed for exact matching Next: use generic data structure + cache Problem: IP lookups have poor memory locality Why? Core routers process tens of thousands of network flows at the same time Different flows have different destination IPs, which cause different memory locations to be accessed 18

LPM - poor memory locality Routing Table Keys Values IP.dst: 129.82.103.121 IP.dst: 144.92.9.21 IP.dst: 130.192.181.193 IP.dst: 131.234.142.33 IP.dst: 144.92.9.21 IP.dst: 130.192.181.193 IP.dst: 129.82.103.121 IP.dst: 129.82.103.121 IP.dst: 131.234.142.33... 19

Latency Latency is also a problem! DRAM latency: ~30ns; SRAM latency ~3ns For large routing tables, SRAM may be too expensive - so need to use DRAM Slow, and caching does not help! 20

Take away point Generic data structures/memory hierarchy have poor performance on IP lookup operations (either too slow, or too large) Need specialized data structures for efficiency We are going to see a few: Tries Tree bitmaps 21

Lookup tries Trie: tree data structure used to store values associated with a set of strings Each node represents a prefix of one or more strings in the set Allows compact encoding, particularly when many strings share compact prefixes 22

Lookup trie - example Example: need to store the following keys and values: CSU: 3 Colorado: 15 Colostate :11 3 C SU olo 15 rado state 11 23

IP lookup tries IP addresses are string of bits Many of them share common prefixes Tries are ideally suited for the task! Also helps with longest prefix matching Lower-priority rules closer to root; higher priority rules close to leaves 24

Unibit tries Simplest possible trie-based approach Each node consumes 1 bit of the destination IP address Each node which stores the last bit of a prefix in the routing table also stores a pointer to the corresponding port number 25

Unibit tries - example Routing Table Prefix Port ROOT 0 101* 1 111* 2 P5 1 P4 0 11001* 3 1* 4 0* 5 1000* 6 P6 0 1 P8 0 0 1 P1 P3 1 1 P9 0 0 1 P2 100000* 7 0 100* 8 110* 9 P7 From Varghese, Network Algorithmics 26

Unibit tries - example II Need to lookup 10000011 ROOT 0 (Using 8-bit addresses for simplicity) P5 1 P4 0 1 P8 0 0 P6 0 1 P1 1 P9 0 1 P2 0 1 P3 0 P7 From Varghese, Network Algorithmics P7 wins over P8 (longest prefix) 27

Considerations on unibit tries Convenient for understanding not so useful in practice Slow! Up to 32 memory accesses per lookup Idea: improve efficiency by looking up multiple bits in each trie node 28

Multibit tries 1 bit per node is too slow! What about B > 1 bits per node? (Multibit tries) #memory accesses cut to 32/B (B is called stride) Advantage: faster (less latency) Disadvantage: what do we do with prefixes whose length is not a multiple of N? E.g. prefix 10101*** but stride B = 3 29

Multibit tries - II Problem: how can we support every prefix length with multibit tries? Use a technique called prefix expansion For each stride, generate all possible prefixes Simple, but Generates entries even for prefixes that are not in the routing table 30

Prefix expansion example Routing Table Prefix Rule 101* P1 111* P2 11001* P3 1* P4 000 001 010 011 100 101 110 111 P5 P5 P5 P5 P8 P1 P9 P2 0* P5 1000* P6 100000* P7 100* P8 000 001 010 011 100 101 110 111 P3 P3 000 001 010 011 100 101 110 111 P7 P6 P6 P6 110* P9 31

Compressed tries: Lulea Lulea algoritm (Degermark et al., 1997): due to prefix expansion, the same rule is oftentimes repeated multiple times in a node Example: Can compress using bitmap+ compressed sequence Bitmap: 1 represents new rule, 0 same rule Example becomes 1100 and P7,P6 000 001 010 011 100 101 110 111 P7 P6 P6 P6 32

Lulea - II Why does this work? Rules are large (even if the trie just stores output port per each prefix, I will still need 16 bits per rule) Bitmaps are small (1 bit per rule) Instead of repeating same rule multiple times, the rule is just specified once - the bitmap takes care of the rest! 33

How can we implement this? Trie-based lookup suitable for SW implementation, but still slow due to DRAM memory latency For high-throughput applications, tree-based algorithms can be implemented using specialized hardware pipeline: Multi-level lookup trie Level-1 node 00 01 10 11 P0 00 01 10 11 00 01 10 11 P7 P2 P2 P3 P5 Memory 1 Memory 2 Memory 3 Level-2 nodes 00 01 10 11 P2 P6 Input Processing element 1 Processing element 2 Processing element 3 Output 34

PLUG PLUG: Flexible Lookup Modules for Rapid Deployment of New Protocols in High-Speed Routers (L. De Carli et al., SIGCOMM 2009) 35

Background Context: packet processing in high-speed routers Control Line card Interface CPU Network function implem. DRAM High-end router Router line card Requirements: real-time processing, high throughput (40Gbps -> 100Gbps -> 1 Tbps) 36

Background - II What s the problem? We know how to build an efficient IP forwarding engine! Well IP lookup is not the only operation that routers need to perform Ethernet lookup Packet classification Challenge: implementing all these operations efficiently within the same hardware 37

Background - III Most operations performed by routers require time-consuming lookups in various specialized data structures IP lookup using tries, Ethernet forwarding using hash tables, etc Interface CPU DRAM IP forward Ethernet forward Line card Regex accelerator SRAM Typically, implemented using ASICs: Complicate to design Expensive Inflexible Can they be replaced by a single module? 38

Background - IV Idea: exploit commonalities between accelerators Orderly combination of steps Data structure: Ethernet forwarding - hash table IP forwarding - lookup trie Each step includes dedicated computation & memory Entry 0 Entry 0 Entry 0... Entry 0 00 01 10 11 Level-1 node Entry 1 Entry 1 Entry 1... Entry 1 P0 00 01 Predictable communication patterns Entry 2 Entry 2 Entry 2... Entry 2 10 11 00 01 10 11 00 01 10 11 P7 P2 P2 P3 P5 P2 P6 Entry 3 Entry 3 Entry 3... Entry 3 Level-2 nodes Bucket 0 Bucket 1 Bucket 2 Bucket n HW design: Memory 1 Memory 2 Memory 3 Memory 4 Processing element 1 Input Processing element 2 Processing element 3 Processing element 4 Output Input Memory 1 Processing element 1 Memory 2 Processing element 2 Memory 3 Processing element 3 Output 39

PLUG (Pipelined LookUp Grid) Efficient execution of algorithmic steps with local data+computation Algorithms (IP lookup etc.) as dataflow graphs (DFGs) Massively parallel HW efficiently executes DFG applications Inputs Outputs Compiler Lookup application 40 Mapping of DFG on tiled hardware

Why dataflow graphs? A dataflow graph implements a program representation based on data dependencies Each node in the graph represents a small sequence of instructions; if the output of the node is used by one or more other nodes, the graph will have edges going from the current node to the ones that consume its output Very well suited for expressing certain types of data structure lookups 41

Lookups as DFGs... Entry 1 Entry 1 Entry 1 Page 1... Ethernet lookup: hash table w/ 4 entries per bucket... Entry 2 Entry 2 Entry 2... Entry 3 Entry 3 Entry 3 Page 2... Page 3...... Entry 4 Entry 4 Entry 4 Bucket 1 Bucket 2 Bucket n Input Page 4... Output Level-1 node 00 01 10 11 Page 3 IP forwarding: multi-level lookup trie 00 01 10 11 P7 P2 P2 P0 00 01 10 11 P3 P5 Page 2 Level-2 nodes 00 01 10 11 Page 1 P2 P6 Input Output 42

PLUG hardware Massively parallel HW architecture Matrix of elements (tiles), each including memory and computation, connected by on-chip network Tile Point-to-point links DFG nodes tiles; edges network links 43

Case study 44

IPv4 forwarding Level-1 node 00 01 10 11 P0 Three-level multibit trie w/ 16/8/8 stride 00 01 10 11 00 01 10 11 P7 P2 P2 P3 P5 Level-2 nodes 00 01 10 11 3 pages per level: 2 for rule array, 1 for children ptrs P2 P6 Level 1 0 1 2 0 1 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 6 Throughput 1G lookup/s Level 2 3 4 5 3 3 3 4 4 4 7 Latency 90 ns Level 3 6 7 Power 0.35 W 45

Some results Applications: 3LUG vs 6SeFLDlL]eG AUFh. - 3oweU 5 3LUG 6SeFLDlL]eG DUFh. 4 IPv4/IPv6: PLUG vs NetLogic NL9000 3oweU [W] Regexp matching: PLUG vs Tarari T10 3 2 1 0 PL8G vs 6SeFLDlL]eG AUFh. - ThUoughSut 1400 250 PL8G 6SeFLDlL]eG DUFh. I3v6 1000 800 600 400 PLUG 6SeFLDlL]eG DUFh. 150 100 50 200 0 DFA PLUG vs 6SeFLDlL]eG AUFh. - LDtenFy 200 LDtenFy [ns] ThUoughSut [0DP6] 1200 I3v4 IPv4 IPv6 0 DFA 46? IPv4 IPv6 DFA

Other Ethernet forwarding (hash table) Packet classification (decision trees) (Vaish et al, ANCS 2011) Ethane (precursor to OpenFlow; hash tables) Seattle (DHT-based forwarding; hash tables + trees) Alternative designs: LEAP (Harris et al., ANCS 2012); SWSL (Kim et al., ANCS 2013) 47

Student comments What is the actual rate of protocol evolution? Does PLUG provide a price and/or performance advantage in a complete, realistic system? Is the evaluation performed with evaluation or simulation? Detailed description of NPU changes is missing Full-scale comparison with existing solutions is missing Too broad a topic for a single paper

That s all for today Next lecture: Guest lecture on Named Data Networking Paper review + attendance are required! 49