FPX Architecture for a Dynamically Extensible Router

FPX Architecture for a Dynamically Extensible Router Alex Chandra, Yuhua Chen, John Lockwood, Sarang Dharmapurikar, Wenjing Tang, David Taylor, Jon Turner http://www.arl.wustl.edu/arl

Dynamically Extensible Router Control Processor Field Programmable Port Ext. Switch Fabric FPX FPX FPX FPX SDRAM 128 FPX MB FPX Field Programmable Port Extenders Line Card Line Card Line Card Line Card Reprogrammable Application Device Line Card SRAM 4 MB Network Interface Device Line Card 2 - Jonathan Turner - 6/19/2002

Special Packet Processing Control Processor Switch Fabric 6 5 6 5 6 5 FPX FPX FPX FPX Smart Port Card 32-64 MB Sys. FPGA FPX FPX North Bridge APIC Line Card Line Card Line Card Pentium Line Card Line Card Line Card Cache 3 3 3 6 6 6 3 - Jonathan Turner - 6/19/2002

Logical Port Architecture reassembly contexts FPX Packet Classification & Route Lookup active flow queues virtual output queues DQ Output Side Processing Packet Classification output queues RC PCU reassembly contexts FPX active flow queues plugins Input Side Processing PCU plugins 4 - Jonathan Turner - 6/19/2002

RAD Block Diagram SDRAM SDRAM from LC from SW Data Path Header Pointer ISAR Packet Storage Manager (includes free space list) Discard OSAR Pointer to LC to SW Header Proc. Classification and Route Lookup Queue Manager Control SRAM Register Set SRAM Route & Filter Updates Register Set Updates & Status DQ Status & Rate Control Control Cell Processor 5 - Jonathan Turner - 6/19/2002

Physical Configuration from LC to LC NID from SW to SW ISAR SDRAM Packet Storage Manager 1 Classification and Route Lookup Route & Filter Updates OSAR Register Set Updates & Status Discard Queue Manager DQ Status & Rate Control Packet Storage Manager 2 SDRAM Control Cell Processor SRAM SRAM 6 - Jonathan Turner - 6/19/2002

Classification and Route Lookup (CARL)! Three lookup engines.» route lookup for routing datagrams - best prefix» flow filters for multicast & reserved flows - exact» general filters (32) for management - exhaustive! Input processing.» parallel check of all three» return highest priority exclusive and highest priority non-exclusive» general filters have unique priority» all flow filters share single priority» ditto for routes 7 - Jonathan Turner - 6/19/2002 Input Demux Route Lookup Flow Filters General Filters headers bypass Result Proc. & Priority Resolution! Output processing.» exact match only! Route lookup & flow filters share off-chip SRAM! General filters processed on-chip

8 - Jonathan Turner - 6/19/2002 Exact Match Lookup! Exact match lookup table used for reserved flows.» includes LFS, signaled QOS flows and multicast» and, flows requiring processing by s» each of these flows has separate queue in QM» multicast flows have two queues (recycling multicast)» implemented using hashing packet src dst 6 5 simple hash on-chip SRAM... 0 1 tag+data tag+data -- 0 0 -- 1 0 0 0 -- 1 1 tag+data -- tag+data -- --... ingress valid egress valid off-chip SRAM tag=[src,dst,sport, dport,proto] data includes 2 outputs+2 QIDs LFS rates packet,byte counters flags separate memory areas for ingress and egress packets

9 - Jonathan Turner - 6/19/2002 General Filter Match! General filter match considers full 5-tuple» prefix match on source and destination addresses» range match on source and destination ports» exact or wildcard match on protocol» each filter has a priority and may be exclusive or nonexclusive! Intended primarily for management filters.» firewall filters» class-based monitoring» class-based special processing! Implemented using parallel exhaustive search. filter memory matcher matcher matcher» limit of 32 filters matcher

Fast IP Lookup (Eatherton & Dittia) 01,10 0 00 0110 11101110 address: 101 100 101 000 1,10 000 001 010 100 101 110 100 -- 11 -- 1 * 011 110 110 100 101 * 0,00 01 00 1,11 0 0 01 0010 00000000 internal bit vector 0 00 0000 00001000 1 00 0000 00000000 0 00 0001 00010010 0 00 0000 00000010 0 00 1000 00000000 0 01 0000 00001100 1 00 0000 00000000 external bit vector! Multibit trie with clever data encoding.» small memory requirements (4-6 bytes per prefix typical)» small memory bandwidth, simple lookup yields fast lookup rates» updates have negligible impact on lookup performance! Avoid impact of external memory latency on throughput by interleaving several concurrent lookups.» 8 lookup engine config. uses about 6% of Virtex 2000E logic cells 10 - Jonathan Turner - 6/19/2002 0 10 1000 00000000 0 00 0100 00000000 0 01 0001 00000000 0 10 0000 00000000

Lookup Throughput & Latency 11 1100 10 1000 Millions of lookups per second 9 8 7 6 5 4 3 2 1 0 Worst-Case Avg. Lookup Latency Mae West Avg. Lookup Latency Mae West Througput Worst-Case Throughput linear throughput gain negligible latency increase 900 800 700 600 500 400 300 200 100 0 Average Lookup Latency (ns) 1 2 3 4 5 6 7 8 # of FIPL engines 11 - Jonathan Turner - 6/19/2002

Update Performance Millions of lookups per second 11 10 9 8 7 6 5 4 3 2 1 0 reasonable update rates have little impact No updates 10K updates/sec 100K updates/sec 1 update every 10 µs 1 2 3 4 5 6 7 8 # of FIPL engines 12 - Jonathan Turner - 6/19/2002

Queue Manager Logical View (QM) separate queues for each reserved flow to link link pkt. sched. res. flow queues datagram queues 64 hashed datagram queues for traffic isolation arriving packets pkt. sched. to from to output 0 res. flow queues VOQ pkt. sched. datagram queue to output 1 to output 8 separate queue for each flow DQ to switch separate queue set for each output. 13 - Jonathan Turner - 6/19/2002

Distributed Queueing periodic queue length reports Control Processor Switch Fabric I O I O I O I O I O I O queue per output Sched. Sched. Sched. Sched. Sched. Sched. Routing Scheduler paces each queue Routing according Routing to backlog share Routing Routing Routing TI TI TI TI TI TI 14 - Jonathan Turner - 6/19/2002

Basic Distributed Queueing Algorithm! Goal: avoid switch congestion and output queue underflow.! Let B(i,j) be backlog at input i for output j, B(j) be backlog at output j.! Can avoid output-side switch congestion if rate(i,j) hi(i,j) = L j S B(i,j)/B(+,j)» where L j is external link rate at output j and S is switch speedup! Can avoid underflow at output j if rate(i,j) lo(i,j) = L j B(i,j)/(B(j) + B(+,j))» this can be achieved if lo(i,+) L i S for all i! Can avoid input-side switch congestion if rate(i,j) hi (i,j) = L i S lo(i,j)/lo(i,+)! Let rate(i,j) = min{ hi(i,j), hi (i,j) }.! Algorithm avoids congestion and for large enough S, avoids underflow.» what is the smallest value of S for which underflow cannot occur? 15 - Jonathan Turner - 6/19/2002

Stress Test can vary number of inputs and outputs used, and length of phases 16 - Jonathan Turner - 6/19/2002

Stress Test Simulation - Min Rates min rate sums 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 speedup=1.5 critical rate +lo(1,5) +lo(1,4) +lo(1,3) lo(1,1) +lo(1,2) second first phase 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 17 - Jonathan Turner - 6/19/2002

allocated rate sums 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Stress Test - Actual Rates speedup=1.5 critical rate rate(1,1) first phase +rate(1,2) 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time Under-use of input bandwidth +rate(1,3) second +rate(1,4) +rate(1,5) 18 - Jonathan Turner - 6/19/2002

Stress Test - Input Queue Lengths input queue lengths 1,000 speedup=1.5 900 input side 800 backlog for final output implies 700 underflow 600 500 400 300 200 100 0 B(1,1) B(1,2) B(1,3) B(1,4) B(1,5) 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 19 - Jonathan Turner - 6/19/2002

Stress Test - Output Queue Lengths output queue length 2,500 speedup=1.5 2,250 persistent output 2,000 side backlog caused by earlier dip in 1,750 forwarding rate 1,500 1,250 B(1) 1,000 750 500 B(2) B(4) 250 B(3) B(5) 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 20 - Jonathan Turner - 6/19/2002

21 - Jonathan Turner - 6/19/2002 Resource Usage Estimates! Key resources in Xilinx FPGAs» flip flops - 38,400» lookup tables (LUTs) - 38,400 n each can implement any 4 input Boolean function» block RAMs (4 Kbits each) - 160 Number % of total flops LUTs RAMs flops LUTs RAMs CARL 2,217 4,695 32 5.8% 12.2% 20.0% CCP 1,156 1,612 3 3.0% 4.2% 1.9% FIFOs 133 284 10 0.3% 0.7% 6.3% ISAR 4,000 5,400 28 10.4% 14.1% 17.5% OSAR 2,000 3,000 24 5.2% 7.8% 15.0% PSM (both) 4,722 4,148 20 12.3% 10.8% 12.5% QM version 1,2 13,258 12,085 27 34.5% 31.5% 16.9% Total 27,486 31,224 144 71.6% 81.3% 90.0% Resource Count 38,400 38,400 160 % Usage 72% 81% 90%

Comparison of available FPGAs on FPXs 25 20 XCV2000e-6 Signal Delay (ns) 15 10 XCV1000e-7 5 flops in opposite corners flops in adjacent cells/clbs 0-2 0 2 4 6 8 LUTs in Datapath (FFs in corners) 22 - Jonathan Turner - 6/19/2002

Summary! Single XCV2000 FPGA can do IP packet processing for gigabit link.» would be simple if just did route lookup and fifo queues» using SDRAMs effectively is hard n significant overheads - dependent on sequences of operations» packet classification & general queueing adds complexity n intelligent packet discarding greatly expands required memory bandwidth» achieving wire-speed operation under worst-case conditions is challenging! Expect to complete first version this fall.! Complete version by middle of 2003. 23 - Jonathan Turner - 6/19/2002