Overview Implementing Gigabit Routers with NetFPGA Prof. Sasu Tarkoma The NetFPGA is a low-cost platform for teaching networking hardware and router design, and a tool for networking researchers. The NetFPGA offloads processing from a host processor. The host's CPU has access to main memory and can DMA to read and write registers and memories on the NetFPGA. A hardware-accelerated datapath. Four Gigabit ports and multiple banks of local memory installed on the card. Uses Verilog and a cross compilation environment. Basic Architectural Components of an IP Router Protocols Software Control Plane Hardware path per-packet processing Per-packet processing in an IP Router 1. Accept packet arriving on an incoming link. 2. Lookup packet destination address in the forwarding table, to identify outgoing port(s). 3. Manipulate packet header: e.g., decrement TTL, update header checksum. 4. S packet to the outgoing port(s). 5. packet in the queue. 6. Transmit packet onto outgoing link. Generic Router Architecture Queue IP Header Packet ~1M prefixes Off-chip DRAM IP Next Hop ~1M packets Off-chip DRAM 1
Generic Router Architecture Rule-of of-thumb IP Header IP Header IP Header size is important Small queues reduce delay Large buffers are expensive A router needs a buffer size of B = 2T*C 2T is the two-way propagation delay (typically 250ms) C is the capacity of the bottleneck link Appears in IETF architectural guidelines TCP flows key input for buffer sizing Number of flow is large enough that flows are indepent and unsynchronized Algorithms Linear search Slow Direct lookup Requires memory, prefix update may lead to many changes Tries Deterministic lookup time, require multiple references TCAM Efficient parallel evaluation, require energy Algorithms CAM Content able Memories Associative memory Compares all entries in parallel Binary CAM Exact matching Ternary CAM Partial matching T-CAM Ternary Content-addressable Memories Partial matching in a single cycle Reports the index of the first match TCAM (prefix) SRAM (next hop address) Algorithmic methods Bloom filter.. T-CAM Fast, cost-effective, simple to manage High power consumption HW compares query word to all stored words (prefixes) in parallel Each bit of a word can be 0,1, or X (don t care) If multiple possible matches, lowest address is returned (shortest) CAM and T-CAM T Applications CAM Translation lookaside buffer (TLB) CPU cache that is used by memory management hardware to improve the speed of virtual address translation. Cache memories compression Image processing Packet forwarding T-CAM Packet forwarding Packet classification L4 switching Intrusion detection Pattern matching base operations 2
Processing Exception Processing Bloom Filters Bloom filter is a probabilistic set membership test (lookup function) Does item x exist in a set or a multiset? Coined by Burton H. Bloom in 1970 Various applications There are no false negatives, but allowable false positives Encoding an attribute a U, n = U Maintain a Bit Vector V of size m Use k hash functions (h 1..h k ), h i : U [1..m] Insert: For item x, set bits V[h 1 (x)]..v[h k (x)]. Lookup: Test bits V[h 1 (i)]..v[h k (i)]. If all are 1, return Probably Yes. Else No. Bloom Filter V 0 V m-1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 h 1 (x) h 2 (x) h 3 (x) h k (x) Bloom Filter Tradeoffs 1 2 3 4 5 6 Build basic router Command Line Protocol Integrate with H/W Interoperability Wow us! Interface (PWOSPF) Three factors: m,k and n. Typically n and m are given, and k is selected K is optimal when the hit ratio (ratio of bits flipped in the array) is 0.5 False positive probability of (1/2) k = 0.6185 m/n Processing software hardware Protocols Processing Innovate and add Presentations Judges Learning Environment Modular design Testing 4-port non-learning switch 4-port learning switch IPv4 router Integrate with S/W Interoperability Wow us! forwarding path 3
Verilog Verilog is a hardware description language (HDL) used to model electronic systems. The language supports the design, verification, and implementation of analog, digital, and mixed-signal circuits at various levels of abstraction. Concept of time is important. Statements are executed concurrently. The language is case-sensitive, has a preprocessor like C, and the major control flow keywords, such as "if" and "while", are similar. Verilog uses Begin/End instead of curly braces to define a block of code. Verilog II The definition of constants in Verilog require a bit width along with their base. A Verilog design consists of a hierarchy of modules. Modules are defined with a set of input, output, and bidirectional ports. Internally, a module contains a list of wires and registers. Concurrent and sequential statements define the behaviour of the module by defining the relationships between the ports, wires, and registers. Sequential statements are placed inside a / block and executed in sequential order within the block. But all concurrent statements and all / blocks in the design are executed in parallel, qualifying Verilog as a flow language. Keywords The always keyword indicates a free-running process that triggers on the accompanying event-control (@) clause. (similar to while(1) {..} in C) always @(posedge a) a <= b; // Run whenever reg a has a low to high change always @(a or b) // Whenever a or b changes The initial keyword indicates a process executes exactly once. The fork/join pair are used by Verilog to create parallel processes. Also forever keyword Delays with # Non blocking operators, for example <= Synthesizable A subset of statements in the language is synthesizable. If the modules in a design contain only synthesizable statements, software can be used to transform or synthesize the design into a netlist that describes the basic components and connections to be implemented in hardware (ASIC, FPGA) // 1 wire out ; assign out = sel? a : b; // 2 reg out; always @(a or b or sel) case(sel) 1'b0: out = b; 1'b1: out = a; case Mux // 3 reg out; always @(a or b or sel) if (sel) out = a; else out = b; 4
FlipFlop FlipFlops module toplevel(clock,reset); input clock; input reset; reg flop1; reg flop2; always @ (posedge reset or posedge clock) if (reset) flop1 <= 0; flop2 <= 1; else flop1 <= flop2; flop2 <= flop1; module Building block for logic One bit storage Counters Finite state machines With Schmitt trigger can be used to implement arbiter in async circuits Select the order of access to a shared resource Note metastability issues Procedural Interface Applications Verilog Procedural Interface (VPI) an interface primarily inted for the C programming language. allows behavioral Verilog code to invoke C functions, and C functions to invoke standard Verilog system tasks. IDS/IDP, Pattern matching, firewalls Content Processing and String Matching IP Lookup and Packet Classfication ing and Queueuing Protocol Processing TCP/IP Flow processing Semantic Processing Classfication and Clustering Reconfigurable Hardware Platforms Soft-core CPUS on FPGAs 5