Computer Networks CS 552 Routers Badri Nath Rutgers University badri@cs.rutgers.edu. High Speed Routers 2. Route lookups Cisco 26: 8 Gbps Cisco 246: 32 Gbps Cisco 286: 28 Gbps Power: 4.2 KW Cost: $5K Juniper M 32 32 Gbps Power 3.2 KW 2 What do routers do? Basic Components of a Traditional High Speed router Routing Decide the next hop based on Destination address Cost varies as Table size Header modification Decrement TTL, Link layer address for next hop, etc Requires rewriting Forwarding Byte movement Move bytes from input interface to output interface Need to keep up with line card speed Routing Protocols Routing Table Forwarding Table Switching Control Plane Datapath per-packet processing 3 4
Forwarding Engine Need for high speed routers payload Packet header Router B/W keeps increasing Need to keep line cards fully utilized Destination Address Routing Lookup Data Structure Forwarding Table Dest-network Port 65.../8 28.9../6 49.2../9 3 Outgoing Port Line Linerate (Gbps) 4B (MPP S) Lookupspeed (nano sec) 84B (MPPS ) 354B (MPPS ) OC3.55.48 283.23.54 OC2.622.94 55.92.22 OC48 2.5 7.8 28 3.72.88 OC92. 3.25 32 4.88 3.53 7 5 6 OC768 4. 25 8 59.52 4.2 Hardware First-Generation IP routers DRAM access times 5 nsec Pricing: DRAM Yr 22 Retail Price 4GB is $32; 2GB is $2 In [Gupta 98] 6 MB for $5 was the price in 998 SRAM access times 5 to nsec Pricing: SRAM is 4 to 5 times more expensive than DRAM GB $ 5; 6 MB $8 CPU DMA Line Card MAC DMA Line Card MAC Buffer Memory DMA Line Card MAC 7 8 2
First-generation IP routers Second-Generation IP routers Shared memory Bus is the bottleneck Memory r/w speeds is also bottleneck Every packet needs two transfers between line cards and memory Crosses the bus twice Route table stored in DRAM Does not scale to too many line cards Suffices for Low speed routers < Gbps speed Cache update CPU DMA Route Cache MAC DMA Route Cache MAC Buffer Memory DMA Route Cache MAC 9 Fast Path Slow Path Second-generation IP routers Third -Generation IP routers Each line card has a route table cache On a hit, forward directly Fast path Switching interface On a miss, via CPU bus, memory Slow path Copy only header, then reconstruct packet on outbound link Buffer packets on cards < 5Gbps speed buffer MAC TSU Forwarding Engine FSU TSU Forwarding Engine FSU buffer MAC Mckeown 97, Partridge 98 2 3
Third generation IP routers Three types of switching fabrics Multiple forwarding engines IP header stripped and given to FEs Header processed separately from body FE determines outbound header Packet reconstructed and moved from Source buffer to destination buffer Exploit Parallelism Have a separate Data transfer path 3 4 Crossbar arbitration Virtual queues FIFO at the input Request for any of the N outputs What about fairness Rotating priority Top priority given to the next line following the line which was last serviced HOL blocking in simple FIFO Complex arbitration N 2 input possibilities M output possibilities Lots of arbiter schemes Decide which of the request from the input queues wins 5 6 4
High Speed routers Juniper Networks M-Series Routers Two multi-gigabit routers. A 5-Gbps IP router by Craig Partridge et.al, in ACM TON, 997 Specialized Hardware 3 rd Gen IP Router Way ahead of its time 2. A 4- Gbps IP router by Sangjin Han et.al., in SIGCOMM 2 PacketShader Software router on commodity hardware (CPU + GPU) What has changed in 3 years? 7 8 MultiGigibit Router (MGR) MGR Separate Switching Back plane fabric Distributed architecture Multiple forwarding engines Forwarding engines determine output line based on header Each FE has its own forwarding table and buffer Routing and forwarding separation FE determines which outbound line to send the packet Only header moves between line card and FE Packet construction and deconstruction done by line cards Line card uses the switching fabric to forward the packet Network Processor, FE processor, Packet Processor Only header moves around Data is transferred from Input to output lone card via switching backplane Two memory banks to handle route updates Routing table handled by NP (maintains several routes to D) Routing Information for various destinations Determine active routes based on policy Forwarding table handled by FEP (maintains only active route to D) Install active routes for each destination FE cache 6 MB, divided into 8-MB banks (used in active-standby modes) 9 2 5
Packet Processing Packet processing Line card: Packet Buffered in FIFO queue Header is removed, passed onto FE FE: Read header, lookup, write modified header Modified header, with forwarding instructions sent back to line card Line card: buffer entire packet for delivery to output line Fast path code Header check, lookup, update TTL update header 42 cycles (each cycle 2.4 nanosec 45 MHz alpha processor) Fast path time:.2 nsec Packet forwarding rate 9.8 MPPS 2 22 Packet Processing MGR Features Slow path Cache miss Header errors Headers with IP options Fragments Multicast FE associated with each line card (but separate) Has its own 45 Mhz Alpha processor and memory FEs keep entire routing table as opposed to cache Switched backplane as opposed to shared bus Switch transfer cycle.38 microsec Max 5 simultaneous transfer of K bits (approx 5 Gbps) Network Processor 233Mhz Alpha processor Used for writing router updates to the routing table in FEs Control plane function 23 24 6
Switch arbitration Switch arbitration Cross bar with FIFO Each head of line packet contends for bus For fairness priority of input line rotated Multiple input buffers Row and column arbiter Diagonal arbiter Wrapped Diagonal arbiter Wave front arbiter On a N x N cross bar, grant request along the diagonal Priority given to higher level diagonals Rotate priority of diagonals 25 26 Forwarding speeds Lots of H/W Specialized H/W Network Processor Switching fabric Smart Forwarding (copy only header) 3 rd generation can match line card speeds MGR router: [Partridge 998] Some observations H/W units FE Processor Network Processor 998 22 45 Mhz 233 Mhz 2.66 GHZ ( Intel xeon x555) Level 2 cache (on-chip) 96KB 256 KB or 52 KB L Cache I-Cache D-Cache 8KB 8KB 64KB 64 KB Registers Switch Cores 32 Registers 5 Port Single 6 or 32 registers? Dual to multi core 27 28 7
What s happening now Silicon Budget in CPU and GPU Then -Specialized hardware -- MGR (997) Now- General purpose hardware (PacketShader 2-SIGCOMM) - GPU Software based high speed router Software Defined Networking (SDN) Trend towards programmability Use commodity H/W Parallel Computing available in GPUs ShaderPrograms manipulate pixel values for scenes Xeon X555: 4 cores 73M transistors ALU GTX48: 48 cores 3,2M transistors 29 3 3 Per-Packet CPU Cycles for G Our Approach 2: GPU Offloading IPv4,2 + 6 =,8 cycles + 6 Cycles needed IPv6 Packet I/O IPv4 lookup,2 +,6 Packet I/O IPv6 lookup = 2,8 Packet I/O Packet I/O IPv4 lookup +,6 IPv6 lookup IPsec,2 + 5,4 = 6,6 + 5,4 Packet I/O Encryption and hashing Packet I/O Encryption and hashing Available budget,4 cycles G, min-sized packets, dual quad-core 2.66GHz CPUs GPU Offloading for Memory-intensive or Compute-intensive operations 3 3 32 32 8
Throughput (Gbps) PacketShader [SIGCOMM 2] A 4 Gbps router that uses commodity PCs Exploit parallelism feature of GPUs by doing lookups on a batch of packets E.g., Xeon CPU has 4 cores E.g., GTX48 GPU has 48 cores Basic routing operations offloaded to GPUs Rendering a scene is a parallel operation (operations on pixels) How is packer routing parallel workload? Key Idea Process packets in batches from a large input buffer Each packet processing handled by a separate core Avoid H/W bottlenecks by partitioning Run packet processing operations on independent cores Parallelism in Packet Processing The key insight Stateless packet processing = parallelizable RX queue. Batching 2. Parallel Processing in GPU 33 34 34 Scaling with a Multi-Core CPU Results (w/ 64B packets) Master core CPU-only CPU+GPU Device driver Shader Preshader Postshader Device driver Worker cores 4 35 3 25 2 5 5 GPU speedup 28.2 39.2 38.2 8 5.6 32 3.2 IPv4 IPv6 OpenFlow IPsec.4x 4.8x 2.x 3.5x 35 35 36 36 9
Take away Packet Processing in Routers is a parallel work load! Batch Packets Same idea can be applied to other CPU intensive networking operations Connection processing Https (SSL Shader- NSDI 2) Computer Networks CS 552. Route lookups Badri Nath Rutgers University badri@cs.rutgers.edu 37 38 Why high speed lookups? Classless Addressing IPv4-2 32 entries - 4* 9 IPv6 2 28 entries - 256 * 36 Naïve lookup : have a table entry for each IP address IP address: output port IPv4 requires 4G entries Memory cost in today s $: 6$/G.. 24 $ Speed: 5 ns for DRAM, 5 ns for SRAM But Routes are advertised as prefixes Every prefix needs to be unwound Class-based: Update cost of the table [Gupta 98] 39 4... Classless:... A 62.23/6 B 9.28.92/8 9.23/6 23/8 9/8 C 255.255.255.255 9.23.4/23 255.255.255.255
Number of active BGP prefixes Number Prefixes, speed Routing Lookups in Hardware [gupta98] Routing table contains prefixes How many prefixes? Size of table is proportional to prefixes Is it small? Prefixes are increasing Size of Routing table is increasing Lookup algorithms Software-based approaches Trie-based algorithms Binary-search on tries, prefixes Hardware-based approaches Route-lookup memory Content-addressable memory 4 April, 2 Prefix length MAE-EAST routing table (source: www.merit.edu) 42 Size of the Routing Table Longest prefix match With CIDR, route entries are prefixes <prefix, CIDR mask> Can be aggregated We need to find the longest matching prefix that matches the destination address Need to search all prefixes of all length (in order) and among prefixes of the same length 28.8.2/24 Date 28.8./6 28.8..24 28.8.2.28 92.2./6 92.2.2.28 Source: http://www.telstra.net/ops/bgptable.html 43 44
Linear Search Tree search Keep N prefixes in a linked list O(N) storage, O(N) lookup time, O() update complexity add at Head of list Arbitrary insertion and deletion O(N) Keep N prefixes in a list sorted on prefix length Improve average time for operations 45 Binary tree Simple binary tree Each left subtree has key values <= root Each right subtree has key values >=root Full key comparison Digital Search tree Branch according to selected bits of the key Left branch bit value, right branch bit value At each level I, check MSB i Example C 3 E 7 5 6 A S R 9 A C E R S 46 Trie Radix trie Trie node next-hop-ptr (if prefix) left-ptr right-ptr Same as Digital search tree Only leaves store data Left to right ordered Leaf Node has Next Hop information (if prefix found) Dept first search; each step compare a bit of the Key Fixed length Prefixes P: P2: P3: O(W) lookup W is the length of Prefix (height of the trie) Storage O(N) leaves + O(N) for internal nodes N is the number of prefixes P P2 P3 Store Variable length prefixes (keys) Use internal nodes to store prefixes A concatenation of all the bits in the path Compare bit i at level i Look up Keep track of prefix seen so far P P * L P2 * L2 P3 * L3 P4 L4 B D F P2 A C P3 E P4 G 47 48 2
Trie search Radix trie At each level, search left subtree or right subtree based on the next bit in the address On visiting a node with a prefix P, mark BPM=prefix P Search ends when there are no more branches ; make LPM = BPM N prefixes, each W-bits: O(W) lookup, O(NW) storage and O(W) update complexity Wastage of storage space in chains Idea: Compress branches with one child Patricia tree 49 5 Patricia tree Patricia trie bit-position left-ptr right-ptr parent-ptr 3 Prefix/cidr-mask parent-ptr Next-hop Leaf node Lookup Longest prefix P4 Lookup Longest prefix P3 Need to backtrack from P4 P P * L P2 * L2 P3 * L3 B A 3 D P2 F P3 C 5 G E P4 P4 L4 Practical Algorithm To Retrieve Information Coded in Alphanumeric 5 52 3
Pat tree features Multi-bit Tries Pat tree is a complete binary tree (node has degree or 2) W-bit prefixes: Worst case O(W 2 ) lookup, O(W) update complexity N leaves and N- internal nodes Less storage Backtrack complexity Can be improved W W/k Binary trie Depth = W Degree = 2 Stride = bit Multi-ary trie Depth = W/k Degree = 2 k Stride = k bits 53 54 Prefix Expansion with Multi-bit Tries Four-ary Trie (k=2) If stride = k bits, prefix lengths that are not a multiple of k need to be expanded E.g., k = 2: Prefix * *, * * * Expanded prefixes Maximum number of expanded prefixes P * H P2 * H2 P3 * H3 P4 H4 corresponding to one non-expanded prefix = 2 k- 55 56 A B P2 A four-ary trie node next-hop-ptr (if prefix) ptr ptr D E F P3 P P 2 G ptr ptr P4 C P4 2 H Lookup 4
Luleå algorithm: Motivation Degermark et al., Small forwarding tables for fast routing lookups in Proc. of SIGCOMM 97 Luleå algorithm Large routing tables Patricia (NetBSD), radix (4.4 BSD) trees 24 bytes for leaves Size: 2 Mbytes 2 Mbytes Naïve binary tree is huge, won t fit in fast CPU cache memory Year Median routing table size 997 4, entries 2 65, 22, 28 25, 24 5, Memory accesses are the bottleneck of lookup Goal: minimize memory accesses, size of data structure Design for 2 4 6K different next-hops Method for compressing the radix tree using bit-vectors 57 CIDR longest prefix match rule: e 2 supersedes e Divide a complete binary tree into three levels Level : one big node representing entire tree depth 6 bits Levels 2 and 3: chunks describe portions of the tree The binary tree is sparse, and most accesses fall into levels and/or 2 32 24 6 Level e e 2 IP address space: 2 32 possible addresses Level 2 Level 3 58 Luleå algorithm: Level Covers all prefixes of length 6 Cut across tree at depth 6 bit vector of length 2 6 Root head =, genuine head =, member of genuine head = Divide bit vector into 2 2 bit masks, each 6 bits long Genuine head One bit mask: 2 3 4 5 6 7 8 9 2 3 4 5 Root head depth 6 59 de word array: se index array: Luleå algorithm: Level One bit mask: depth 6 2 3 4 5 6 7 8 9 2 3 4 5 Head information stored in pointer array: 2 4 Next-hop table: r r2 3 r3 r4 r5 2 3 4 3 L2 chunk One 6-bit pointer per bit set (=) in bit-mask Pointer composed of 2 bits of type info; 4 bits of indexing info Genuine heads: index into next-hop table Root heads: index into array of Level 2 (L2) chunks Problem: given an IP address, find the index into the pointer array 6 5
Luleå: Finding pointer group Luleå: Finding pointer group Group pointers by 6-bit bit masks; how many bit masks to skip? Recall: Bit vector is 2 6 total length Code word array code (2 2 entries) One entry/6-bit bit mask, so indexed by top 2 bits of IP address 6-bit offset : num/ptrs to skip to find st ptr for that bit mask in ptr array Four bit masks, max 4 6 = 48 bits set, 63, so value may be too big Base index array base (2 entries) One base index per four code words: num/ptrs to skip for those four bit masks Indexed by top bits of IP address : 2 2 3 e.g. bit vector: 6 : 2 6 3 base: 3 IP address 2 4 6 bix ix code: bit codeword six ten maptable: 2 675 pix := + + ; Extract top bits from IP address: bix Extract top 2 bits from IP address: ix Skip code[ix].six + base[bix] pointer groups in the pointer table 62 Luleå: Finding pointer in pointer group Luleå: Summary of finding ptr index a(n) number of possible bit masks of length 2 n a() = ; a(n) = + a(n ) 2 a(4) + = 678 So maptable can be indexed with bits ten field of code indexes maptable maptable entries are 4-bit offsets structure: pre-computed and constant For each pattern of the bit mask, the values in each cell is fixed 3 ten IP value address varies depending on tree 2 4 6 maptable: 2 3 4..... 5 675 677 5 bix bit ix : code: codeword six ten base: 3 IP address 2 4 6 bix ix code: bit codeword six ten pix := + + ; maptable: 2 4... 5 675 2 base: 2 pix := + + ; 63 64 6
Luleå algorithm: Levels 2 and 3 Luleå: Summary Consist of chunks, pointed to by root heads Chunk covers subtree of height 8, so 256 heads Three types of chunk: Sparse: -8 heads, array of 8-bit indices of the heads Dense: 9-64 heads, like Level but only one base index Very dense: 65-256 heads, same format as Level 2 4 Tradeoff mutability and table construction time for speed Adding a routing entry requires rebuilding entire table Routing tables don t often change Bottom line Lookup: 8 memory references touching 4 bytes Table: 5 Kbytes for 4, entries; 4 5 bytes/entry Current state of the art in router IP lookup Open issue: scaling to IPv6 (28 bit address) L2 chunk 65 66 Hash tables [Waldvogel 98] Binary Search on Trie Levels [waldvogel98] Store prefixes of different lengths Chain prefixes of same length Array Size is O(distinct string lengths) Search: Extract the largest number of bits Try match: If match return nexthop else decrease to the next length and repeat L 5 7 2 H Define recursive search order Search top table If match, search longer prefixes If No match, search shorter prefixes Add markers to guide search Markers are longest subprefix found in longer prefix length bins At most log 2 (W) lookups Scales for IPv6 2 3 4 5 6 67 P 68 7
Route lookup in Hardware Route Lookups in Hardware Store all prefixes in memory/high speed cache IPv4 4G entries Store 24 bit prefixes (most route entries) 6 M 998 Prices 5$ - today G can be had for $ Store 24-bit prefixes with next hop information in memory For longer prefixes use secondary table Two-level page table idea One memory access time of 5 nsec 92.62.34.4 92.62.34 24 Prefixes up to 24-bits 92.62.34 Next Hop 2 24 = 6M entries Next Hop 69 Gupta 98- Infocom 7 Routing Lookups in Hardware Prefixes up to 24-bits Routing lookups in H/W 62.54.34 62.54.34.4 Next Hop 62.54.34 24 Pointer 4 8 offset base Prefixes above 24-bits Next Hop Next Hop Memory is cheap Can achieve nsec lookup times Can improve technique to fit in SRAM Depends on prefix length distribution Update complexity Two memory banks (switch after each update) Update every entry Update ranges but tag entry with prefix length Need to delete a lot of entries for each prefix delete /6 256 entries, /8 64K entries 7 72 8
Content-addressable Memory (CAM) Research Fully associative memory TCAM ternary CAM (,,*) Exact match operation in a single clock cycle: parallel compare Content (Destination address) is the key, address where content is stored is returned X more expensive than DRAM CAM: Good for fixed length data Variable length prefixes (TCAM) 2-4 MB Power consumption ( W) Dest Address ADDR Prefix CAM Prefix Next hop Technology trends Multicore Processors Energy, work schedule, parallelization Slow path on one core, fast path on another core DVFS Virtualization Run more than one OS (rtos, linux or BSD) RTOS for fast path, linux for slow path Solid state drives Low energy memory SDR More or less work for routers? More updates, more communication 73 74 Papers [NSDI3] Wirespeed name lookup : A GPU based approach https://www.usenix.org/system/files/conference/nsdi3/nsdi3-final32.pdf [NSDI] SSL shader http://www.ndsl.kaist.edu/~kyoungsoo/papers/sslshader.pdf [CoNext ] Multilayer packet classification using GPU http://winlab.rutgers.edu/~feixiong/docs/conext24.pdf 75 9