Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Similar documents
CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

Midterm Review. Congestion Mgt, CIDR addresses,tcp processing, TCP close. Routing. hierarchical networks. Routing with OSPF, IS-IS, BGP-4

ECE697AA Lecture 20. Forwarding Tables

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Homework 1 Solutions:

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

Chapter 12 Digital Search Structures

Growth of the Internet Network capacity: A scarce resource Good Service

Frugal IP Lookup Based on a Parallel Search

LONGEST prefix matching (LPM) techniques have received

Last Lecture: Network Layer

Dynamic Pipelining: Making IP- Lookup Truly Scalable

Computer Networks CS 552

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

Multiway Range Trees: Scalable IP Lookup with Fast Updates

Scalable Lookup Algorithms for IPv6

Efficient Construction Of Variable-Stride Multibit Tries For IP Lookup

Power Efficient IP Lookup with Supernode Caching

Data Structures for Packet Classification

15-744: Computer Networking. Routers

Scalable High-Speed Prefix Matching

FPGA Implementation of Lookup Algorithms

CS 268: Route Lookup and Packet Classification

Router Design: Table Lookups and Packet Scheduling EECS 122: Lecture 13

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 7, JULY An On-Chip IP Address Lookup Algorithm. Xuehong Sun and Yiqiang Q. Zhao, Member, IEEE

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

A Configuration-only Approach to FIB Reduction. Paul Francis Hitesh Ballani, Tuan Cao Cornell

Multiway Range Trees: Scalable IP Lookup with Fast Updates

A SURVEY OF RECENT IP LOOKUP SCHEMES 1

University of Alberta. Sunil Ravinder. Master of Science. Department of Computing Science

The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM

CS 268: Computer Networking

Novel Hardware Architecture for Fast Address Lookups

A Pipelined IP Address Lookup Module for 100 Gbps Line Rates and beyond

Memory-Efficient State Lookups with Fast Updates

Routing Lookup Algorithm for IPv6 using Hash Tables

Advanced Memory Organizations

Binary Search Schemes for Fast IP Lookups

Shape Shifting Tries for Faster IP Route Lookup

Packet Classification. George Varghese

Inside Internet Routers

IP packet forwarding, or simply, IP-lookup, is a classic

Shape Shifting Tries for Faster IP Route Lookup

Packet Classification for Core Routers: Is there an alternative to CAMs?

Recursively Partitioned Static IP Router-Tables *

Parallel-Search Trie-based Scheme for Fast IP Lookup

ADDRESS LOOKUP SOLUTIONS FOR GIGABIT SWITCH/ROUTER

Balanced Trees Part One

ECE697AA Lecture 21. Packet Classification

IP Address Lookup and Packet Classification Algorithms

Routing Basics. ISP Workshops. Last updated 10 th December 2015

Routers: Forwarding EECS 122: Lecture 13

Routers: Forwarding EECS 122: Lecture 13

Routing Basics. ISP Workshops

Efficient hardware architecture for fast IP address lookup. Citation Proceedings - IEEE INFOCOM, 2002, v. 2, p

Updating Designed for Fast IP Lookup

Review on Tries for IPv6 Lookups

WASHINGTON UNIVERSITY SEVER INSTITUTE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL ENGINEERING HARDWARE-BASED INTERNET PROTOCOL PREFIX LOOKUPS

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Lecture 12: Addressing. CSE 123: Computer Networks Alex C. Snoeren

Efficient Packet Classification using Splay Tree Models

Search. Butler Lampson, V Srinivasan and George Varghese. PC in a worst case time of 490 nsec.

Fast and scalable conflict detection for packet classifiers

CSE 123A Computer Networks

Lecture 5: Router Architecture. CS 598: Advanced Internetworking Matthew Caesar February 8, 2011

Lecture 12: Aggregation. CSE 123: Computer Networks Alex C. Snoeren

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Main Memory and the CPU Cache

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Scalable IP Routing Lookup in Next Generation Network

Fast Firewall Implementations for Software and Hardware-based Routers

Professor Yashar Ganjali Department of Computer Science University of Toronto.

Suppose you are accessing elements of an array: ... or suppose you are dereferencing pointers: temp->next->next = elem->prev->prev;

Network Processors and their memory

Compressing IP Forwarding Tables for Fun and Profit

Network Superhighway CSCD 330. Network Programming Winter Lecture 13 Network Layer. Reading: Chapter 4

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

FlashTrie: Hash-based Prefix-Compressed Trie for IP Route Lookup Beyond 100Gbps

CSE 530A. B+ Trees. Washington University Fall 2013

CSCD 330 Network Programming

Novel Hardware Architecture for Fast Address Lookups

Course Administration

Principles for Efficient Protocol Implementations. George Varghese UCSD CSE

Welcome to Part 3: Memory Systems and I/O

Memory Hierarchy Design for a Multiprocessor Look-up Engine

Network Processors. Nevin Heintze Agere Systems

Master Course Computer Networks IN2097

EVERY Internet router today can forward entering Internet

Routing Basics. Campus Network Design & Operations Workshop

Routing Concepts. IPv4 Routing Forwarding Some definitions Policy options Routing Protocols

Lecture 11: Packet forwarding

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Internet Routers Past, Present and Future

Fast binary and multiway prefix searches for packet forwarding

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

Understanding the Optimizer

Transcription:

Recent Results in Best Matching Prex George Varghese October 16, 2001

Router Model InputLink i 100100 B2 Message Switch B3 OutputLink 6 100100 Processor(s) B1 Prefix Output Link 0* 1 100* 6 1* 2 Forwarding 1

Three Bottlenecks for Router Forwarding Performance Many chores (e.g., checksum calculation, hop count update) but major bottlenecks are: Looking up IP destination addresses (best matching prex) and classication. Switching packet to the output link. Scheduling (e.g., fair queuing) at output. (also accounting and possible security checks) 2

Routing Environment Backbone routers: 100,000 prexes, all lengths from 8 to 32; density at 24 and 16 bits. Unstable BGP implementations may require route updates in 10 msec. May need programmability. Enterprise routers: (1000 prexes), route changes in seconds, bimodal (64, 1519). IPv6 plans heavy aggregation but multihoming, Anycast etc, make large backbone databases likely. 3

Hardware or Software? RISC CPUs, expensive to make main memory accesses (say 50 nsec), cheaper to access L1 or L2 cache. Network processors now very common mostly with multiple RISC cores and using multithreading where each RISC works on dierent packets to get around memory latency. Cheap Forwarding engine (say $20 in volume). Fast (5-10 nsec) but expensive SRAM and cheap but slower (50-100 nsec) DRAM (roughly six times cheaper per byte). Can use on-chip variants as well. On-chip SRAM of around 32 Mbits with 2-5 nsec available on custom chips. Less on FPGAs. 4

Sample Database = 10 * = 111 * P3 = 11001* P4 = 1* P5 = 0* P6 = 1000 * P7 = 100000* P8 = 1000000* 5

Compressed Trie using Sample Database P5 ROOT 0 1 P4 0 1 0 0 1 0 1 0 1 P6 0 P3 1 0 P7 0 1 P8 6

Skip Count versus Path Compression 0 1 0 1 1 0 1 P3 P4 0 1 0 0 P3 (skip compressed) Skip 2 or 00 (path compressed) 1 P4 Removing one way branches ensures that number of trie nodes is at most twice number of prexes. Using a skip count requires exact match at end and backtracking on failure. Path compression simpler. 7

Binary Search and Prex Lookup A 0 0 0 A B 0 0 should go here A B C 0 A B C F A B F F A F F F Natural idea is to encode a prex like A* by padding with 0's. i.e. A000. Problem: Several dierent addresses, each of which should be mapped to dierent prexes, can end up in same range of search table. 8

5.1.6 Modied Binary Search A 0 0 0 A B 0 0 A B C 0 A B C F A B F F A F F F A B C D A B D D A D D D Solution: Encode a prex like A as a range by inserting two values, A000 and AFFF into search table. Now when an identifer falls into a range, each range maps to a unique prex. Mapping can be precomputed when table is built. 9

Why Modied Search Works L L L L H H L L, H H H H Any range in the binary search table corresponds to the earliest L that is not followed by an H. Easy to precompute using a stack. 10

Modied Binary Search 1) A 0 0 0 2) A B 0 0 3) A B C 0 4) A B C D A B C F A B F F A F F F > = 1 1 2 2 3 3 3 4 2 3 1 2 1 Need to handle case of exact match (= pointer) separately from case where key falls between two entries (> pointer). 11

Controlled Prex Expansion Reduce a set of prexes with N distinct lengths to equivalent prexes with M < N distinct lengths. Recursively apply two simple ideas: Expansion: Expand prex P of length L into two equivalent prexes P 0 and P 1 of length L +1. Capture: When a prex expanded from Level L to L + 1 has same value as a prex at Level L + 1, discard expanded prex. 12

Expansion Algorithm W = Max Prefix length 0 1 L All prefixes of length L,link y W, link x Place all prexes in table by length. Pick expansion levels L i. Start expanding from 1 upward skipping expansion levels. At step we expand prexes at Level i to i + 1, taking into account capture. 13

Expanded Version of Trie P5 P5 P4 00 01 10 11 000 001 010 011 100 101 110 111 P6 P6 000 001 010 011 100 101 110 111 P3 00 01 10 11 P8 P7 Reduces search time from 7 memory accesses worst case to 3 accesses. 14

Optimal Expanded Tries s Cost = 2 s T1 T T 2 y 1 Ty Cost = sum of costs of covering T through T y using k 1 levels, Pick stride s for root and recursively solve covering problem for the subtrees rooted at Trie Level s + 1 using k, 1 levels. Choose stride to make update times bounded (say < 1 msec) 15

Stanford Algorithm Observes that most prexes are 24 bits or less and that DRAM with 2 24 locations will be widely available soon. Suggests starting with a 24 bit initial array lookup. To handle prexes of larger length they use a 24-8-8 multibit trie. A problem is updating up to 2 24 locations when a prex changes. Suggest hardware techniques for parallel updates. 16

Level Compressed Tries 0 1 P4 P4 P3 P5 P6 P7 P8 P3 P5 P6 P7 P8 Forms variable stride trie by decomposing 1-bit trie into full subtries. Variable stride tries more exible (e.g., 8,2;2 versus 4,2,4) Leads to less ecient use of memory and no tunability (e.g., allowing memory to be traded for speed) 17

Array Representation of LC Tries P3 P4 P5 P6 P7 P8 P4 P3 P5 P6 P7 P8 Array layout (by trie level) slows down insertion and deletion. Higher trie height (average 6) and skip count compression (backtracking) leads to slower search times. 18

Lulea Scheme: Node Compression of Multibit Tries = 0*, = 11*, P3 = 10000* 000 001 010 011 100 101 110 111 def ptr def 1 0 ptr 0 def 0 1 1 1 0 Expansion can lead to a lot of lled entries in each multibit trie array. If we (eectively) do linear search for the earliest array entry with a valid prex, we can use bit map compression. 19

Why Lulea Compression is Eective P3 P3 10.. 100.. 100.... 10........10.... 0 If we consider each prex as a range, the number of breakpoints in the function that maps indices to prexes is at most twice the number of prexes. Also basis for binary search on prexes (Lampson et al, Infocom 98) 20

Why Lulea Speed Loss is not too bad 0 3 5.. NumSet[J].. 10001001 10000001..... 11111000.... J 00000011 NumSet[J] + 3 8 8 Uncompressed Index Compressed Index For large arrays of bits we store with each word j, NumSet[j], the number of bits set in words < J. High order bits of index yield j and counting bits in Word j give compressed Index. Takes 3 memory accesses in software and 2 in hardware. 21

Wash U Scheme: Binary Search of Possible Prex Lengths Naive search of all possible prex lengths (W exact matches, W possible lengths). Want to do binary search (log 2 W exact matches). Dierent from binary search on prex ranges that would take (log 2 N memory accesses, which is worse for large N) Need to add markers to guide binary search and precompute best matching prexes for these markers. 22

Why Markers are Needed 1 2 3 4 5 6 7 P4=1* bmp=p4 =10* bmp= P6=1000* bmp=p6 P7=100000* bmp=p7 =111 bmp= P3=11001* bmp=p3 P5=0* bmp=p5 P8=1000000* bmp=p8 Length 1 Length 3 Length 5 Length 7 Length 2 Length 4 Length 6 23

How backtracking can occur 1 2 3 4 5 6 7 P4=1* =10* P6=1000* P7=100000* =111 P3=11001* P5=0* 11 1100 P8=1000000* (marker) (marker) bmp=p8 Length 1 Length 3 Length 5 Length 7 Length 2 Length 4 Length 6 Markers announce \Possibly better info on right. No guarantees." Can lead to wild goose chase. 24

Avoiding Backtracking by Precomputation We precompute the best matching prex of each marker as Marker:bmp. When we get a match with a marker we do search to the right but we remember the bmp of the latest marker we have encountered. 25

Final Database 1 2 3 4 5 6 7 P4=1* bmp=p4 =10* bmp= P6=1000* bmp=p6 P7=100000* bmp=p7 =111 bmp= P3=11001* bmp=p3 P5=0* bmp=p5 11 bmp=p4 1100 bmp=p4 P8=1000000* bmp=p8 Length 1 Length 3 Length 5 Length 7 Length 2 Length 4 Length 6 26

Eatherton Scheme: Compression alternative to Lulea = 0*, = 11*, P3 = 10000* 000 001 010 011 100 101 110 111 def ptr def 1 0 ptr 0 def 0 1 1 1 0 Lulea does leaf pushing and uses huge arrays (16 bits). Hard to update. Instead, Eatherton does no leaf pushing and uses small arrays (no larger than 8 bits, even smaller). Encodes pointers and next hops using two dierent arrays using two dierent bit maps. Gets around use of smaller strides by accessing bit maps from previous node and accessing all bit maps in one access. 1 memory access per node compared with 2 for Lulea. Can aord half the 27

stride length. Lots of tricks to reduce memory access width. One example, the access to the next hop pointer can be done lazily at the end. Described in Eatherton's M.S. thesis at UCSD CSE, EE department. Parts of thesis may be patented by Cisco as Eatherton joined there. 28

2001 Conclusions Prex tables getting very large. Up to 500,000 seems required. Controlled prex expansion works ne with DRAM. Lots of licensees of Wash U patent. Using RAMBUS like technologies with net processors works ne. With SRAM like technologies, want to limit use of fast memory. Can do simple algorithms and make them parallel using say pipelining (Juniper patent seems to use unibit tries?) or using SRAM. Binary search not too bad using B-trees with large width (uses memory well, can search in rough log W (N) time, where W is number of prexes that can be compared in one access. Lulea's memory use is remarkable. Selling custom solutions (maybe more expensive) for their patent: www.enet.com. But slow updates and requries L2 cache. Alternative bitmap compressions like 29

Eatherton may be eective though patent coverages are very unclear. Wash U patents for prex expansion. ckense@msnotes.wustl.edu. Only controlled expansion mostly used because of non-determinism of Binary Search on Prex lengths Tag switching still alive and well for QoS (classication tags really) in MPLS. Classication mostly done today by CAMs and custom hardware (PMC Sierra's ClassiPI is an alternative). CAMs still too small and power-hungry for lookups. 30