Computer Networks CS 552

Similar documents
PacketShader: A GPU-Accelerated Software Router

Inside Internet Routers

Growth of the Internet Network capacity: A scarce resource Good Service

Last Lecture: Network Layer

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

GPGPU introduction and network applications. PacketShaders, SSLShader

ECE697AA Lecture 20. Forwarding Tables

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Recent Advances in Software Router Technologies

CS 268: Route Lookup and Packet Classification

Router Architectures

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

PacketShader as a Future Internet Platform

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

15-744: Computer Networking. Routers

Network Layer: Control/data plane, addressing, routers

Multi-gigabit Switching and Routing

Routers: Forwarding EECS 122: Lecture 13

Master Course Computer Networks IN2097

Master Course Computer Networks IN2097

Routers Technologies & Evolution for High-Speed Networks

Routers: Forwarding EECS 122: Lecture 13

CSE 123A Computer Networks

CS 552 Computer Networks

Lecture 5: Router Architecture. CS 598: Advanced Internetworking Matthew Caesar February 8, 2011

LONGEST prefix matching (LPM) techniques have received

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

Routing architecture and forwarding

Binary Search Schemes for Fast IP Lookups

Scalable IP Routing Lookup in Next Generation Network

CSC 401 Data and Computer Communications Networks

Data Structures for Packet Classification

Router Construction. Workstation-Based. Switching Hardware Design Goals throughput (depends on traffic model) scalability (a function of n) Outline

ITTC High-Performance Networking The University of Kansas EECS 881 Packet Switch I/O Processing

ADDRESS LOOKUP SOLUTIONS FOR GIGABIT SWITCH/ROUTER

CS 268: Computer Networking

Frugal IP Lookup Based on a Parallel Search

Professor Yashar Ganjali Department of Computer Science University of Toronto.

Parallel-Search Trie-based Scheme for Fast IP Lookup

TOC: Switching & Forwarding

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Chapter 4 Network Layer: The Data Plane

Midterm Review. Congestion Mgt, CIDR addresses,tcp processing, TCP close. Routing. hierarchical networks. Routing with OSPF, IS-IS, BGP-4

Novel Hardware Architecture for Fast Address Lookups

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

Router Design: Table Lookups and Packet Scheduling EECS 122: Lecture 13

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

NetFPGA Hardware Architecture

INF5050 Protocols and Routing in Internet (Friday ) Subject: IP-router architecture. Presented by Tor Skeie

The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM

Chapter 12 Digital Search Structures

Network Superhighway CSCD 330. Network Programming Winter Lecture 13 Network Layer. Reading: Chapter 4

Network Processors. Nevin Heintze Agere Systems

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview

COMP211 Chapter 4 Network Layer: The Data Plane

The Network Layer and Routers

Routers. Session 12 INST 346 Technologies, Infrastructure and Architecture

Multiway Range Trees: Scalable IP Lookup with Fast Updates

TOC: Switching & Forwarding

Lecture 16: Router Design

Lecture 12: Aggregation. CSE 123: Computer Networks Alex C. Snoeren

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers

P51: High Performance Networking

FPGA Implementation of Lookup Algorithms

Scalable Enterprise Networks with Inexpensive Switches

MPLS MULTI PROTOCOL LABEL SWITCHING OVERVIEW OF MPLS, A TECHNOLOGY THAT COMBINES LAYER 3 ROUTING WITH LAYER 2 SWITCHING FOR OPTIMIZED NETWORK USAGE

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

High-Speed Network Processors. EZchip Presentation - 1

IP Address Lookup in Hardware for High-Speed Routing

IP ROUTING LOOKUP: HARDWARE AND SOFTWARE APPROACH. A Thesis RAVIKUMAR V. CHAKARAVARTHY

Outline. The demand The San Jose NAP. What s the Problem? Most things. Time. Part I AN OVERVIEW OF HARDWARE ISSUES FOR IP AND ATM.

Packet Switch Architectures Part 2

Topic 4a Router Operation and Scheduling. Ch4: Network Layer: The Data Plane. Computer Networking: A Top Down Approach

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

CSCE 463/612 Networks and Distributed Processing Spring 2018

CS244a: An Introduction to Computer Networks

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

1-1. Switching Networks (Fall 2010) EE 586 Communication and. October 25, Lecture 24

IV. PACKET SWITCH ARCHITECTURES

Forwarding Architecture

CSCD 330 Network Programming

A 400Gbps Multi-Core Network Processor

Chapter 4 Network Layer: The Data Plane

Network layer: Overview. Network layer functions IP Routing and forwarding NAT ARP IPv6 Routing

Network layer: Overview. Network Layer Functions

Lecture 12: Addressing. CSE 123: Computer Networks Alex C. Snoeren

소프트웨어기반고성능침입탐지시스템설계및구현

The IP Data Plane: Packets and Routers

Scalable Packet Classification for IPv6 by Using Limited TCAMs

Efficient hardware architecture for fast IP address lookup. Citation Proceedings - IEEE INFOCOM, 2002, v. 2, p

Virtual to physical address translation

Router Architecture Overview

Lecture 3. The Network Layer (cont d) Network Layer 1-1

Transcription:

Computer Networks CS 552 Routers Badri Nath Rutgers University badri@cs.rutgers.edu. High Speed Routers 2. Route lookups Cisco 26: 8 Gbps Cisco 246: 32 Gbps Cisco 286: 28 Gbps Power: 4.2 KW Cost: $5K Juniper M 32 32 Gbps Power 3.2 KW 2 What do routers do? Basic Components of a Traditional High Speed router Routing Decide the next hop based on Destination address Cost varies as Table size Header modification Decrement TTL, Link layer address for next hop, etc Requires rewriting Forwarding Byte movement Move bytes from input interface to output interface Need to keep up with line card speed Routing Protocols Routing Table Forwarding Table Switching Control Plane Datapath per-packet processing 3 4

Forwarding Engine Need for high speed routers payload Packet header Router B/W keeps increasing Need to keep line cards fully utilized Destination Address Routing Lookup Data Structure Forwarding Table Dest-network Port 65.../8 28.9../6 49.2../9 3 Outgoing Port Line Linerate (Gbps) 4B (MPP S) Lookupspeed (nano sec) 84B (MPPS ) 354B (MPPS ) OC3.55.48 283.23.54 OC2.622.94 55.92.22 OC48 2.5 7.8 28 3.72.88 OC92. 3.25 32 4.88 3.53 7 5 6 OC768 4. 25 8 59.52 4.2 Hardware First-Generation IP routers DRAM access times 5 nsec Pricing: DRAM Yr 22 Retail Price 4GB is $32; 2GB is $2 In [Gupta 98] 6 MB for $5 was the price in 998 SRAM access times 5 to nsec Pricing: SRAM is 4 to 5 times more expensive than DRAM GB $ 5; 6 MB $8 CPU DMA Line Card MAC DMA Line Card MAC Buffer Memory DMA Line Card MAC 7 8 2

First-generation IP routers Second-Generation IP routers Shared memory Bus is the bottleneck Memory r/w speeds is also bottleneck Every packet needs two transfers between line cards and memory Crosses the bus twice Route table stored in DRAM Does not scale to too many line cards Suffices for Low speed routers < Gbps speed Cache update CPU DMA Route Cache MAC DMA Route Cache MAC Buffer Memory DMA Route Cache MAC 9 Fast Path Slow Path Second-generation IP routers Third -Generation IP routers Each line card has a route table cache On a hit, forward directly Fast path Switching interface On a miss, via CPU bus, memory Slow path Copy only header, then reconstruct packet on outbound link Buffer packets on cards < 5Gbps speed buffer MAC TSU Forwarding Engine FSU TSU Forwarding Engine FSU buffer MAC Mckeown 97, Partridge 98 2 3

Third generation IP routers Three types of switching fabrics Multiple forwarding engines IP header stripped and given to FEs Header processed separately from body FE determines outbound header Packet reconstructed and moved from Source buffer to destination buffer Exploit Parallelism Have a separate Data transfer path 3 4 Crossbar arbitration Virtual queues FIFO at the input Request for any of the N outputs What about fairness Rotating priority Top priority given to the next line following the line which was last serviced HOL blocking in simple FIFO Complex arbitration N 2 input possibilities M output possibilities Lots of arbiter schemes Decide which of the request from the input queues wins 5 6 4

High Speed routers Juniper Networks M-Series Routers Two multi-gigabit routers. A 5-Gbps IP router by Craig Partridge et.al, in ACM TON, 997 Specialized Hardware 3 rd Gen IP Router Way ahead of its time 2. A 4- Gbps IP router by Sangjin Han et.al., in SIGCOMM 2 PacketShader Software router on commodity hardware (CPU + GPU) What has changed in 3 years? 7 8 MultiGigibit Router (MGR) MGR Separate Switching Back plane fabric Distributed architecture Multiple forwarding engines Forwarding engines determine output line based on header Each FE has its own forwarding table and buffer Routing and forwarding separation FE determines which outbound line to send the packet Only header moves between line card and FE Packet construction and deconstruction done by line cards Line card uses the switching fabric to forward the packet Network Processor, FE processor, Packet Processor Only header moves around Data is transferred from Input to output lone card via switching backplane Two memory banks to handle route updates Routing table handled by NP (maintains several routes to D) Routing Information for various destinations Determine active routes based on policy Forwarding table handled by FEP (maintains only active route to D) Install active routes for each destination FE cache 6 MB, divided into 8-MB banks (used in active-standby modes) 9 2 5

Packet Processing Packet processing Line card: Packet Buffered in FIFO queue Header is removed, passed onto FE FE: Read header, lookup, write modified header Modified header, with forwarding instructions sent back to line card Line card: buffer entire packet for delivery to output line Fast path code Header check, lookup, update TTL update header 42 cycles (each cycle 2.4 nanosec 45 MHz alpha processor) Fast path time:.2 nsec Packet forwarding rate 9.8 MPPS 2 22 Packet Processing MGR Features Slow path Cache miss Header errors Headers with IP options Fragments Multicast FE associated with each line card (but separate) Has its own 45 Mhz Alpha processor and memory FEs keep entire routing table as opposed to cache Switched backplane as opposed to shared bus Switch transfer cycle.38 microsec Max 5 simultaneous transfer of K bits (approx 5 Gbps) Network Processor 233Mhz Alpha processor Used for writing router updates to the routing table in FEs Control plane function 23 24 6

Switch arbitration Switch arbitration Cross bar with FIFO Each head of line packet contends for bus For fairness priority of input line rotated Multiple input buffers Row and column arbiter Diagonal arbiter Wrapped Diagonal arbiter Wave front arbiter On a N x N cross bar, grant request along the diagonal Priority given to higher level diagonals Rotate priority of diagonals 25 26 Forwarding speeds Lots of H/W Specialized H/W Network Processor Switching fabric Smart Forwarding (copy only header) 3 rd generation can match line card speeds MGR router: [Partridge 998] Some observations H/W units FE Processor Network Processor 998 22 45 Mhz 233 Mhz 2.66 GHZ ( Intel xeon x555) Level 2 cache (on-chip) 96KB 256 KB or 52 KB L Cache I-Cache D-Cache 8KB 8KB 64KB 64 KB Registers Switch Cores 32 Registers 5 Port Single 6 or 32 registers? Dual to multi core 27 28 7

What s happening now Silicon Budget in CPU and GPU Then -Specialized hardware -- MGR (997) Now- General purpose hardware (PacketShader 2-SIGCOMM) - GPU Software based high speed router Software Defined Networking (SDN) Trend towards programmability Use commodity H/W Parallel Computing available in GPUs ShaderPrograms manipulate pixel values for scenes Xeon X555: 4 cores 73M transistors ALU GTX48: 48 cores 3,2M transistors 29 3 3 Per-Packet CPU Cycles for G Our Approach 2: GPU Offloading IPv4,2 + 6 =,8 cycles + 6 Cycles needed IPv6 Packet I/O IPv4 lookup,2 +,6 Packet I/O IPv6 lookup = 2,8 Packet I/O Packet I/O IPv4 lookup +,6 IPv6 lookup IPsec,2 + 5,4 = 6,6 + 5,4 Packet I/O Encryption and hashing Packet I/O Encryption and hashing Available budget,4 cycles G, min-sized packets, dual quad-core 2.66GHz CPUs GPU Offloading for Memory-intensive or Compute-intensive operations 3 3 32 32 8

Throughput (Gbps) PacketShader [SIGCOMM 2] A 4 Gbps router that uses commodity PCs Exploit parallelism feature of GPUs by doing lookups on a batch of packets E.g., Xeon CPU has 4 cores E.g., GTX48 GPU has 48 cores Basic routing operations offloaded to GPUs Rendering a scene is a parallel operation (operations on pixels) How is packer routing parallel workload? Key Idea Process packets in batches from a large input buffer Each packet processing handled by a separate core Avoid H/W bottlenecks by partitioning Run packet processing operations on independent cores Parallelism in Packet Processing The key insight Stateless packet processing = parallelizable RX queue. Batching 2. Parallel Processing in GPU 33 34 34 Scaling with a Multi-Core CPU Results (w/ 64B packets) Master core CPU-only CPU+GPU Device driver Shader Preshader Postshader Device driver Worker cores 4 35 3 25 2 5 5 GPU speedup 28.2 39.2 38.2 8 5.6 32 3.2 IPv4 IPv6 OpenFlow IPsec.4x 4.8x 2.x 3.5x 35 35 36 36 9

Take away Packet Processing in Routers is a parallel work load! Batch Packets Same idea can be applied to other CPU intensive networking operations Connection processing Https (SSL Shader- NSDI 2) Computer Networks CS 552. Route lookups Badri Nath Rutgers University badri@cs.rutgers.edu 37 38 Why high speed lookups? Classless Addressing IPv4-2 32 entries - 4* 9 IPv6 2 28 entries - 256 * 36 Naïve lookup : have a table entry for each IP address IP address: output port IPv4 requires 4G entries Memory cost in today s $: 6$/G.. 24 $ Speed: 5 ns for DRAM, 5 ns for SRAM But Routes are advertised as prefixes Every prefix needs to be unwound Class-based: Update cost of the table [Gupta 98] 39 4... Classless:... A 62.23/6 B 9.28.92/8 9.23/6 23/8 9/8 C 255.255.255.255 9.23.4/23 255.255.255.255

Number of active BGP prefixes Number Prefixes, speed Routing Lookups in Hardware [gupta98] Routing table contains prefixes How many prefixes? Size of table is proportional to prefixes Is it small? Prefixes are increasing Size of Routing table is increasing Lookup algorithms Software-based approaches Trie-based algorithms Binary-search on tries, prefixes Hardware-based approaches Route-lookup memory Content-addressable memory 4 April, 2 Prefix length MAE-EAST routing table (source: www.merit.edu) 42 Size of the Routing Table Longest prefix match With CIDR, route entries are prefixes <prefix, CIDR mask> Can be aggregated We need to find the longest matching prefix that matches the destination address Need to search all prefixes of all length (in order) and among prefixes of the same length 28.8.2/24 Date 28.8./6 28.8..24 28.8.2.28 92.2./6 92.2.2.28 Source: http://www.telstra.net/ops/bgptable.html 43 44

Linear Search Tree search Keep N prefixes in a linked list O(N) storage, O(N) lookup time, O() update complexity add at Head of list Arbitrary insertion and deletion O(N) Keep N prefixes in a list sorted on prefix length Improve average time for operations 45 Binary tree Simple binary tree Each left subtree has key values <= root Each right subtree has key values >=root Full key comparison Digital Search tree Branch according to selected bits of the key Left branch bit value, right branch bit value At each level I, check MSB i Example C 3 E 7 5 6 A S R 9 A C E R S 46 Trie Radix trie Trie node next-hop-ptr (if prefix) left-ptr right-ptr Same as Digital search tree Only leaves store data Left to right ordered Leaf Node has Next Hop information (if prefix found) Dept first search; each step compare a bit of the Key Fixed length Prefixes P: P2: P3: O(W) lookup W is the length of Prefix (height of the trie) Storage O(N) leaves + O(N) for internal nodes N is the number of prefixes P P2 P3 Store Variable length prefixes (keys) Use internal nodes to store prefixes A concatenation of all the bits in the path Compare bit i at level i Look up Keep track of prefix seen so far P P * L P2 * L2 P3 * L3 P4 L4 B D F P2 A C P3 E P4 G 47 48 2

Trie search Radix trie At each level, search left subtree or right subtree based on the next bit in the address On visiting a node with a prefix P, mark BPM=prefix P Search ends when there are no more branches ; make LPM = BPM N prefixes, each W-bits: O(W) lookup, O(NW) storage and O(W) update complexity Wastage of storage space in chains Idea: Compress branches with one child Patricia tree 49 5 Patricia tree Patricia trie bit-position left-ptr right-ptr parent-ptr 3 Prefix/cidr-mask parent-ptr Next-hop Leaf node Lookup Longest prefix P4 Lookup Longest prefix P3 Need to backtrack from P4 P P * L P2 * L2 P3 * L3 B A 3 D P2 F P3 C 5 G E P4 P4 L4 Practical Algorithm To Retrieve Information Coded in Alphanumeric 5 52 3

Pat tree features Multi-bit Tries Pat tree is a complete binary tree (node has degree or 2) W-bit prefixes: Worst case O(W 2 ) lookup, O(W) update complexity N leaves and N- internal nodes Less storage Backtrack complexity Can be improved W W/k Binary trie Depth = W Degree = 2 Stride = bit Multi-ary trie Depth = W/k Degree = 2 k Stride = k bits 53 54 Prefix Expansion with Multi-bit Tries Four-ary Trie (k=2) If stride = k bits, prefix lengths that are not a multiple of k need to be expanded E.g., k = 2: Prefix * *, * * * Expanded prefixes Maximum number of expanded prefixes P * H P2 * H2 P3 * H3 P4 H4 corresponding to one non-expanded prefix = 2 k- 55 56 A B P2 A four-ary trie node next-hop-ptr (if prefix) ptr ptr D E F P3 P P 2 G ptr ptr P4 C P4 2 H Lookup 4

Luleå algorithm: Motivation Degermark et al., Small forwarding tables for fast routing lookups in Proc. of SIGCOMM 97 Luleå algorithm Large routing tables Patricia (NetBSD), radix (4.4 BSD) trees 24 bytes for leaves Size: 2 Mbytes 2 Mbytes Naïve binary tree is huge, won t fit in fast CPU cache memory Year Median routing table size 997 4, entries 2 65, 22, 28 25, 24 5, Memory accesses are the bottleneck of lookup Goal: minimize memory accesses, size of data structure Design for 2 4 6K different next-hops Method for compressing the radix tree using bit-vectors 57 CIDR longest prefix match rule: e 2 supersedes e Divide a complete binary tree into three levels Level : one big node representing entire tree depth 6 bits Levels 2 and 3: chunks describe portions of the tree The binary tree is sparse, and most accesses fall into levels and/or 2 32 24 6 Level e e 2 IP address space: 2 32 possible addresses Level 2 Level 3 58 Luleå algorithm: Level Covers all prefixes of length 6 Cut across tree at depth 6 bit vector of length 2 6 Root head =, genuine head =, member of genuine head = Divide bit vector into 2 2 bit masks, each 6 bits long Genuine head One bit mask: 2 3 4 5 6 7 8 9 2 3 4 5 Root head depth 6 59 de word array: se index array: Luleå algorithm: Level One bit mask: depth 6 2 3 4 5 6 7 8 9 2 3 4 5 Head information stored in pointer array: 2 4 Next-hop table: r r2 3 r3 r4 r5 2 3 4 3 L2 chunk One 6-bit pointer per bit set (=) in bit-mask Pointer composed of 2 bits of type info; 4 bits of indexing info Genuine heads: index into next-hop table Root heads: index into array of Level 2 (L2) chunks Problem: given an IP address, find the index into the pointer array 6 5

Luleå: Finding pointer group Luleå: Finding pointer group Group pointers by 6-bit bit masks; how many bit masks to skip? Recall: Bit vector is 2 6 total length Code word array code (2 2 entries) One entry/6-bit bit mask, so indexed by top 2 bits of IP address 6-bit offset : num/ptrs to skip to find st ptr for that bit mask in ptr array Four bit masks, max 4 6 = 48 bits set, 63, so value may be too big Base index array base (2 entries) One base index per four code words: num/ptrs to skip for those four bit masks Indexed by top bits of IP address : 2 2 3 e.g. bit vector: 6 : 2 6 3 base: 3 IP address 2 4 6 bix ix code: bit codeword six ten maptable: 2 675 pix := + + ; Extract top bits from IP address: bix Extract top 2 bits from IP address: ix Skip code[ix].six + base[bix] pointer groups in the pointer table 62 Luleå: Finding pointer in pointer group Luleå: Summary of finding ptr index a(n) number of possible bit masks of length 2 n a() = ; a(n) = + a(n ) 2 a(4) + = 678 So maptable can be indexed with bits ten field of code indexes maptable maptable entries are 4-bit offsets structure: pre-computed and constant For each pattern of the bit mask, the values in each cell is fixed 3 ten IP value address varies depending on tree 2 4 6 maptable: 2 3 4..... 5 675 677 5 bix bit ix : code: codeword six ten base: 3 IP address 2 4 6 bix ix code: bit codeword six ten pix := + + ; maptable: 2 4... 5 675 2 base: 2 pix := + + ; 63 64 6

Luleå algorithm: Levels 2 and 3 Luleå: Summary Consist of chunks, pointed to by root heads Chunk covers subtree of height 8, so 256 heads Three types of chunk: Sparse: -8 heads, array of 8-bit indices of the heads Dense: 9-64 heads, like Level but only one base index Very dense: 65-256 heads, same format as Level 2 4 Tradeoff mutability and table construction time for speed Adding a routing entry requires rebuilding entire table Routing tables don t often change Bottom line Lookup: 8 memory references touching 4 bytes Table: 5 Kbytes for 4, entries; 4 5 bytes/entry Current state of the art in router IP lookup Open issue: scaling to IPv6 (28 bit address) L2 chunk 65 66 Hash tables [Waldvogel 98] Binary Search on Trie Levels [waldvogel98] Store prefixes of different lengths Chain prefixes of same length Array Size is O(distinct string lengths) Search: Extract the largest number of bits Try match: If match return nexthop else decrease to the next length and repeat L 5 7 2 H Define recursive search order Search top table If match, search longer prefixes If No match, search shorter prefixes Add markers to guide search Markers are longest subprefix found in longer prefix length bins At most log 2 (W) lookups Scales for IPv6 2 3 4 5 6 67 P 68 7

Route lookup in Hardware Route Lookups in Hardware Store all prefixes in memory/high speed cache IPv4 4G entries Store 24 bit prefixes (most route entries) 6 M 998 Prices 5$ - today G can be had for $ Store 24-bit prefixes with next hop information in memory For longer prefixes use secondary table Two-level page table idea One memory access time of 5 nsec 92.62.34.4 92.62.34 24 Prefixes up to 24-bits 92.62.34 Next Hop 2 24 = 6M entries Next Hop 69 Gupta 98- Infocom 7 Routing Lookups in Hardware Prefixes up to 24-bits Routing lookups in H/W 62.54.34 62.54.34.4 Next Hop 62.54.34 24 Pointer 4 8 offset base Prefixes above 24-bits Next Hop Next Hop Memory is cheap Can achieve nsec lookup times Can improve technique to fit in SRAM Depends on prefix length distribution Update complexity Two memory banks (switch after each update) Update every entry Update ranges but tag entry with prefix length Need to delete a lot of entries for each prefix delete /6 256 entries, /8 64K entries 7 72 8

Content-addressable Memory (CAM) Research Fully associative memory TCAM ternary CAM (,,*) Exact match operation in a single clock cycle: parallel compare Content (Destination address) is the key, address where content is stored is returned X more expensive than DRAM CAM: Good for fixed length data Variable length prefixes (TCAM) 2-4 MB Power consumption ( W) Dest Address ADDR Prefix CAM Prefix Next hop Technology trends Multicore Processors Energy, work schedule, parallelization Slow path on one core, fast path on another core DVFS Virtualization Run more than one OS (rtos, linux or BSD) RTOS for fast path, linux for slow path Solid state drives Low energy memory SDR More or less work for routers? More updates, more communication 73 74 Papers [NSDI3] Wirespeed name lookup : A GPU based approach https://www.usenix.org/system/files/conference/nsdi3/nsdi3-final32.pdf [NSDI] SSL shader http://www.ndsl.kaist.edu/~kyoungsoo/papers/sslshader.pdf [CoNext ] Multilayer packet classification using GPU http://winlab.rutgers.edu/~feixiong/docs/conext24.pdf 75 9