FPX Architecture for a Dynamically Extensible Router

Similar documents
Users Guide: Fast IP Lookup (FIPL) in the FPX

WUCS-TM-02-?? September 13, 2002

WUCS-TM-02-?? September 23, 2005

Using the Open Network Lab

Field-programmable Port Extender (FPX) August 2001 Workshop. John Lockwood, Assistant Professor

Design and Evaluation of a High-Performance Dynamically Extensible Router

Design of a High Performance Dynamically Extensible Router

Demonstration of a High Performance Active Router DARPA Demo - 9/24/99

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Efficient Packet Classification for Network Intrusion Detection using FPGA

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

The Network Layer and Routers

First Gigabit Kits Workshop

Protocol Wrappers for Layered Network Packet Processing in Reconfigurable Hardware

Scheduling Data Flows using DRR

TCP-Splitter: Design, Implementation and Operation

Routers Technologies & Evolution for High-Speed Networks

Packet Switch Architectures Part 2

Router Architectures

Multi-gigabit Switching and Routing

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Master Course Computer Networks IN2097

Last Lecture: Network Layer

NetFPGA Hardware Architecture

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

MULTI-PLANE MULTI-STAGE BUFFERED SWITCH

HWP2 Application level query routing HWP1 Each peer knows about every other beacon B1 B3

The Network Processor Revolution

Cisco Series Internet Router Architecture: Packet Switching

Network Processors. Nevin Heintze Agere Systems

Homework 1 Solutions:

Internet Worm and Virus Protection for Very High-Speed Networks

Implementation of an Open Multi-Service Router

Master Course Computer Networks IN2097

P51: High Performance Networking

EE 122: Router Design

Lecture 16: Router Design

Configuration Commands. Generic Commands. description XRS Quality of Service Guide Page 125

Cisco IOS Switching Paths Overview

Design, Simulation and FPGA Implementation of a Novel Router for Bulk Flow TCP in Optical IP Networks

IP Address Lookup and Packet Classification Algorithms

Sample Routers and Switches. High Capacity Router Cisco CRS-1 up to 46 Tb/s thruput. Routers in a Network. Router Design

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Design and Implementation of a Shared Memory Switch Fabric

Traditional network management methods have typically

CS 5114 Network Programming Languages Data Plane. Nate Foster Cornell University Spring 2013

Frugal IP Lookup Based on a Parallel Search

Performance Evaluation of Myrinet-based Network Router

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

Router Design: Table Lookups and Packet Scheduling EECS 122: Lecture 13

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Lecture 16: Network Layer Overview, Internet Protocol

Lecture 3: Packet Forwarding

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

The Washington University Smart Port Card

Introduction to Routers and LAN Switches

A Framework for Rule Processing in Reconfigurable Network Systems

Hash-Based String Matching Algorithm For Network Intrusion Prevention systems (NIPS)

Queuing Disciplines. Order of Packet Transmission and Dropping. Laboratory. Objective. Overview

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai

COMP211 Chapter 4 Network Layer: The Data Plane

Urgency Based Scheduler Scalability - How many Queues?!

A Scalable, Cache-Based Queue Management Subsystem for Network Processors

Arista EOS Central Drop Counters

EECS150 - Digital Design Lecture 17 Memory 2

CSCI Computer Networks

Routing, Routers, Switching Fabrics

Improving QOS in IP Networks. Principles for QOS Guarantees

This document provides an overview of buffer tuning based on current platforms, and gives general information about the show buffers command.

High-Speed Network Processors. EZchip Presentation - 1

Chapter 4 Network Layer: The Data Plane

Lecture 24: Scheduling and QoS

Growth of the Internet Network capacity: A scarce resource Good Service

Quality of Service (QoS)

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Control and Configuration Software for a Reconfigurable Networking Hardware Platform

Research paper Measured Capacity of an Ethernet: Myths and Reality

A distributed architecture of IP routers

Experimental Evaluation of a Coarse-Grained Switch Scheduler

Optimizing Memory Bandwidth of a Multi-Channel Packet Buffer

Implementation of Boundary Cutting Algorithm Using Packet Classification

Networking hierarchy Internet architecture

Network Layer Enhancements

Table of Contents. Cisco Buffer Tuning for all Cisco Routers

Switching and Forwarding - continued

Cisco Nexus 9500 Series Switches Buffer and Queuing Architecture

Scalable Packet Classification on FPGA

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Introduction CHAPTER 1

Lecture 13. Quality of Service II CM0256

Quality of Service. Understanding Quality of Service

CSE 473 Introduction to Computer Networks. Exam 1. Your name: 9/26/2013

FPgrep and FPsed: Packet Payload Processors for Managing the Flow of Digital Content on Local Area Networks and the Internet

FPGA Based Packet Classification Using Multi-Pipeline Architecture

Design of a Flexible Open Platform for High Performance Active Networks

Chapter 1. Introduction

Transcription:

FPX Architecture for a Dynamically Extensible Router Alex Chandra, Yuhua Chen, John Lockwood, Sarang Dharmapurikar, Wenjing Tang, David Taylor, Jon Turner http://www.arl.wustl.edu/arl

Dynamically Extensible Router Control Processor Field Programmable Port Ext. Switch Fabric FPX FPX FPX FPX SDRAM 128 FPX MB FPX Field Programmable Port Extenders Line Card Line Card Line Card Line Card Reprogrammable Application Device Line Card SRAM 4 MB Network Interface Device Line Card 2 - Jonathan Turner - 6/19/2002

Special Packet Processing Control Processor Switch Fabric 6 5 6 5 6 5 FPX FPX FPX FPX Smart Port Card 32-64 MB Sys. FPGA FPX FPX North Bridge APIC Line Card Line Card Line Card Pentium Line Card Line Card Line Card Cache 3 3 3 6 6 6 3 - Jonathan Turner - 6/19/2002

Logical Port Architecture reassembly contexts FPX Packet Classification & Route Lookup active flow queues virtual output queues DQ Output Side Processing Packet Classification output queues RC PCU reassembly contexts FPX active flow queues plugins Input Side Processing PCU plugins 4 - Jonathan Turner - 6/19/2002

RAD Block Diagram SDRAM SDRAM from LC from SW Data Path Header Pointer ISAR Packet Storage Manager (includes free space list) Discard OSAR Pointer to LC to SW Header Proc. Classification and Route Lookup Queue Manager Control SRAM Register Set SRAM Route & Filter Updates Register Set Updates & Status DQ Status & Rate Control Control Cell Processor 5 - Jonathan Turner - 6/19/2002

Physical Configuration from LC to LC NID from SW to SW ISAR SDRAM Packet Storage Manager 1 Classification and Route Lookup Route & Filter Updates OSAR Register Set Updates & Status Discard Queue Manager DQ Status & Rate Control Packet Storage Manager 2 SDRAM Control Cell Processor SRAM SRAM 6 - Jonathan Turner - 6/19/2002

Classification and Route Lookup (CARL)! Three lookup engines.» route lookup for routing datagrams - best prefix» flow filters for multicast & reserved flows - exact» general filters (32) for management - exhaustive! Input processing.» parallel check of all three» return highest priority exclusive and highest priority non-exclusive» general filters have unique priority» all flow filters share single priority» ditto for routes 7 - Jonathan Turner - 6/19/2002 Input Demux Route Lookup Flow Filters General Filters headers bypass Result Proc. & Priority Resolution! Output processing.» exact match only! Route lookup & flow filters share off-chip SRAM! General filters processed on-chip

8 - Jonathan Turner - 6/19/2002 Exact Match Lookup! Exact match lookup table used for reserved flows.» includes LFS, signaled QOS flows and multicast» and, flows requiring processing by s» each of these flows has separate queue in QM» multicast flows have two queues (recycling multicast)» implemented using hashing packet src dst 6 5 simple hash on-chip SRAM... 0 1 tag+data tag+data -- 0 0 -- 1 0 0 0 -- 1 1 tag+data -- tag+data -- --... ingress valid egress valid off-chip SRAM tag=[src,dst,sport, dport,proto] data includes 2 outputs+2 QIDs LFS rates packet,byte counters flags separate memory areas for ingress and egress packets

9 - Jonathan Turner - 6/19/2002 General Filter Match! General filter match considers full 5-tuple» prefix match on source and destination addresses» range match on source and destination ports» exact or wildcard match on protocol» each filter has a priority and may be exclusive or nonexclusive! Intended primarily for management filters.» firewall filters» class-based monitoring» class-based special processing! Implemented using parallel exhaustive search. filter memory matcher matcher matcher» limit of 32 filters matcher

Fast IP Lookup (Eatherton & Dittia) 01,10 0 00 0110 11101110 address: 101 100 101 000 1,10 000 001 010 100 101 110 100 -- 11 -- 1 * 011 110 110 100 101 * 0,00 01 00 1,11 0 0 01 0010 00000000 internal bit vector 0 00 0000 00001000 1 00 0000 00000000 0 00 0001 00010010 0 00 0000 00000010 0 00 1000 00000000 0 01 0000 00001100 1 00 0000 00000000 external bit vector! Multibit trie with clever data encoding.» small memory requirements (4-6 bytes per prefix typical)» small memory bandwidth, simple lookup yields fast lookup rates» updates have negligible impact on lookup performance! Avoid impact of external memory latency on throughput by interleaving several concurrent lookups.» 8 lookup engine config. uses about 6% of Virtex 2000E logic cells 10 - Jonathan Turner - 6/19/2002 0 10 1000 00000000 0 00 0100 00000000 0 01 0001 00000000 0 10 0000 00000000

Lookup Throughput & Latency 11 1100 10 1000 Millions of lookups per second 9 8 7 6 5 4 3 2 1 0 Worst-Case Avg. Lookup Latency Mae West Avg. Lookup Latency Mae West Througput Worst-Case Throughput linear throughput gain negligible latency increase 900 800 700 600 500 400 300 200 100 0 Average Lookup Latency (ns) 1 2 3 4 5 6 7 8 # of FIPL engines 11 - Jonathan Turner - 6/19/2002

Update Performance Millions of lookups per second 11 10 9 8 7 6 5 4 3 2 1 0 reasonable update rates have little impact No updates 10K updates/sec 100K updates/sec 1 update every 10 µs 1 2 3 4 5 6 7 8 # of FIPL engines 12 - Jonathan Turner - 6/19/2002

Queue Manager Logical View (QM) separate queues for each reserved flow to link link pkt. sched. res. flow queues datagram queues 64 hashed datagram queues for traffic isolation arriving packets pkt. sched. to from to output 0 res. flow queues VOQ pkt. sched. datagram queue to output 1 to output 8 separate queue for each flow DQ to switch separate queue set for each output. 13 - Jonathan Turner - 6/19/2002

Distributed Queueing periodic queue length reports Control Processor Switch Fabric I O I O I O I O I O I O queue per output Sched. Sched. Sched. Sched. Sched. Sched. Routing Scheduler paces each queue Routing according Routing to backlog share Routing Routing Routing TI TI TI TI TI TI 14 - Jonathan Turner - 6/19/2002

Basic Distributed Queueing Algorithm! Goal: avoid switch congestion and output queue underflow.! Let B(i,j) be backlog at input i for output j, B(j) be backlog at output j.! Can avoid output-side switch congestion if rate(i,j) hi(i,j) = L j S B(i,j)/B(+,j)» where L j is external link rate at output j and S is switch speedup! Can avoid underflow at output j if rate(i,j) lo(i,j) = L j B(i,j)/(B(j) + B(+,j))» this can be achieved if lo(i,+) L i S for all i! Can avoid input-side switch congestion if rate(i,j) hi (i,j) = L i S lo(i,j)/lo(i,+)! Let rate(i,j) = min{ hi(i,j), hi (i,j) }.! Algorithm avoids congestion and for large enough S, avoids underflow.» what is the smallest value of S for which underflow cannot occur? 15 - Jonathan Turner - 6/19/2002

Stress Test can vary number of inputs and outputs used, and length of phases 16 - Jonathan Turner - 6/19/2002

Stress Test Simulation - Min Rates min rate sums 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 speedup=1.5 critical rate +lo(1,5) +lo(1,4) +lo(1,3) lo(1,1) +lo(1,2) second first phase 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 17 - Jonathan Turner - 6/19/2002

allocated rate sums 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Stress Test - Actual Rates speedup=1.5 critical rate rate(1,1) first phase +rate(1,2) 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time Under-use of input bandwidth +rate(1,3) second +rate(1,4) +rate(1,5) 18 - Jonathan Turner - 6/19/2002

Stress Test - Input Queue Lengths input queue lengths 1,000 speedup=1.5 900 input side 800 backlog for final output implies 700 underflow 600 500 400 300 200 100 0 B(1,1) B(1,2) B(1,3) B(1,4) B(1,5) 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 19 - Jonathan Turner - 6/19/2002

Stress Test - Output Queue Lengths output queue length 2,500 speedup=1.5 2,250 persistent output 2,000 side backlog caused by earlier dip in 1,750 forwarding rate 1,500 1,250 B(1) 1,000 750 500 B(2) B(4) 250 B(3) B(5) 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 time 20 - Jonathan Turner - 6/19/2002

21 - Jonathan Turner - 6/19/2002 Resource Usage Estimates! Key resources in Xilinx FPGAs» flip flops - 38,400» lookup tables (LUTs) - 38,400 n each can implement any 4 input Boolean function» block RAMs (4 Kbits each) - 160 Number % of total flops LUTs RAMs flops LUTs RAMs CARL 2,217 4,695 32 5.8% 12.2% 20.0% CCP 1,156 1,612 3 3.0% 4.2% 1.9% FIFOs 133 284 10 0.3% 0.7% 6.3% ISAR 4,000 5,400 28 10.4% 14.1% 17.5% OSAR 2,000 3,000 24 5.2% 7.8% 15.0% PSM (both) 4,722 4,148 20 12.3% 10.8% 12.5% QM version 1,2 13,258 12,085 27 34.5% 31.5% 16.9% Total 27,486 31,224 144 71.6% 81.3% 90.0% Resource Count 38,400 38,400 160 % Usage 72% 81% 90%

Comparison of available FPGAs on FPXs 25 20 XCV2000e-6 Signal Delay (ns) 15 10 XCV1000e-7 5 flops in opposite corners flops in adjacent cells/clbs 0-2 0 2 4 6 8 LUTs in Datapath (FFs in corners) 22 - Jonathan Turner - 6/19/2002

Summary! Single XCV2000 FPGA can do IP packet processing for gigabit link.» would be simple if just did route lookup and fifo queues» using SDRAMs effectively is hard n significant overheads - dependent on sequences of operations» packet classification & general queueing adds complexity n intelligent packet discarding greatly expands required memory bandwidth» achieving wire-speed operation under worst-case conditions is challenging! Expect to complete first version this fall.! Complete version by middle of 2003. 23 - Jonathan Turner - 6/19/2002