// Bottlenecks Memory, memory, 88 - Switch and Router Design Dr. David Hay Ross 8b dhay@cs.huji.ac.il Source: Nick Mckeown, Isaac Keslassy Packet Processing Examples Address Lookup (IP/Ethernet) Where to send an incoming packet? Use output-port, to send packets to MAC address ::::89:ab Exact Match Use output-port, to send packets to destination network./ -(Longest Prefix Match) Packet Processing Examples Intrusion Detection Schemes Deep Packet inspection (DPI) Drop all packets that contains the string EvilWorm anywhere within the packet SNORT rule set Firewall, ACL Which packet to accept or deny? Drop all packets from evil source network./ on ports - Usually needs fields: source-address, dest-address, sourceport, dest-port, protocol Packet Processing Rate Memory Technology (-) Year 99 999 Line Mb/s.Gb/s Gb/s Gb/s B packets (Mpkt/s).9.8. Technology Networking DRAM Single chip density $/chip ($/MByte) MB $-$ ($.-$.) SRAM MB $-$ ($-$8) TCAM MB $-$ ($-$) Access speed -8ns Watts/ chip.-w -8ns -W -8ns -W. Lookup mechanism must be simple and easy to implement. (Surprise?) Memory access time is the long-term bottleneck Note: Price, speed and power are manufacturer and market dependent. Numbers are a bit outdated but give the general idea
// Simplest Task: Exact Matching Solution : Binary Search Mostly in bridges Bridges works in layer (Ethernet) Bridges connects two Ethernet networks Wire-speed forwarding: Each time a packet arrives at a bridge, forward it according to the destination MAC address Store/update also the source MAC address (learning) Should be done at wire speed a b Bridge c d MAC addresses have values which can be sorted Thus, when keeping them sorted, one can perform a binary search on the array and find the right MAC address However, each iteration is a memory access log N memory accesses works fine (even using DRAM) for small speed, N (around Mb/s, 8K values) but doesn t scale for large N/higher speeds (not even for Mb/s, K values) Using faster hardware (SRAM) won t really solve the problem (and it is more expensive ) Scaling using Hashing Example (Gigaswitch, 99) Hashing is much faster than binary search on average, however much slower on the worst case (up to linear time ) However, one can choose (pre-compute) good hash functions, so the number of collision can be small and bounded Precomputation takes a lot of time, but addresses are not added in rapid rate Applying the hash functions is done on wire-speed More sophisticated data structure/hashing techniques can also be applied (e.g. to reduce memory) Bloom Filters, fingerprinting, etc. N = K; binary search takes memory accesses For each 8-bit address addr, we first apply h(addr), to get 8-bit value: LSB are the hash-table entry index (K entries) Each entry is a balanced binary tree of height at most, sorted by the remaining MSB The hash function should guarantee that no more than 8 addresses are in the same tree, and that we can disambiguate between addresses using the MSB Solve corner-cases separately (CAM); rehashing memory accesses IP longest prefix matching Destination =..9. ------------------------------- payload OK better even better Prefix Next Hop Interface.../... Output-port.../8...9 Output-port.../... Output-port Longest Prefix Match is Harder than Exact Match The destination address of an arriving packet does not carry with it the information to determine the length of the longest matching prefix Hence, one needs to search among the space of all prefix lengths; as well as the space of all prefixes of a given length best!..8./... IP Forwarding Table Output-port
// Current Practical Data Problem Definition Caching works poorly in backbone routers, concurrent flows Wire speed lookup needed for -byte packets % are TCP acks nsec/packet in Gbsand 8 nsec/packet in Gbs Lookup dominated by memory accesses speed is measured by memory accesses Prefix length 8- Today, prefixes with growth million prefixes Higher speeds need SRAM Worth minimizing memory 9../, R 9../../ 9../, R../, R 9../ 9... 9...... LPM: Find the most specific route, or the longest matching prefix among all the prefixes matching the destination address of an incoming packet LPM in IPv Use exact match algorithms for LPM! Metrics for Lookup Algorithms Network Address We can start with prefix length 8 Exact match against prefixes of length Exact match against prefixes of length Exact match against prefixes of length Priority Encode and pick Port Speed (= number of memory accesses) Storage requirements (= amount of memory) Low update time Scalability With length of prefix: IPv unicast(b), Ethernet (8b), IPv multicast (b), IPv unicast(8b) With size of routing table: (sweetspotfor today s designs = million) Flexibility in implementation Low preprocessing time Our Toy Example Unibit(=Radix) Tries P = * P = * P = * = * = * = * = * = * = * P = * P = * P = * = * = * = * = * = * = * pointer prefix pointer Packet: 8.....,,, Forward to
// Unibit Tries Compacting One-Way Branches (variant of PARTICIA tree) P = * P = * P = * = * = * = * = * = * = * P P P = * P = * P = * = * = * = * = * = * = * P P P P P = * P = * P = * = * = * = * = * = * = * P P P = * P = * P = * = * = * = * = * = * = * P P P P Input: Memory: null Input: Memory: P = * P = * P = * = * = * = * = * = * = * P P P = * P = * P = * = * = * = * = * = * = * P P P P Input: Memory: Input: Memory:
// Unibit Tries - Analysis W-bit prefixes, N -prefixes: O(W) lookup, O(NW) storage and O(W) update complexity Patricia: O(N) storage (why?) Still slow, high memory, but: Simple Extensible to wider fields Multi-bit Tries W W/k Binary trie Depth = W Degree = Stride = bit Multi-ary trie Depth = W/k Degree = k Stride = k bits Principle: Trade Memory for Speed Prefix Expansion with Multi-bit Tries Quadrary-Trie(k=) If stride = k bits, prefix lengths that are not a multiple of k need to be expanded E.g., k = : Prefix * *, * * * Expanded prefixes Maximum number of expanded prefixes corresponding to one non-expanded prefix = k- P = * P = * P = * = * = * = * = * = * = * a b a Pa Pb Pa a b Pb b Pa Pb Prefix Expansion Increases Storage Consumption Ternary Content-Addressable Memory (TCAM) Replication of next-hop ptr Greater number of unused (null) pointers in a node Time ~ W/k Storage ~ NW/k * k- Improvement: From Fixed-Stride Tries to Variable Stride Tries 8 9 TCAM Array Each entry is a word in {,, } W and represents a rule Encoder 8 9 Match lines Search Key
// Example TCAM Benefits and Disadvantages 8 9 Encoder Match lines Deterministic Search Throughput O() search Very flexible to other problems as well Next week: multi-field packet classifications However, relatively costly and energyconsuming $ for small (Mbit) TCAM Energy depends on the number of entries ~ million TCAM devices already deployed Typical Dimensions and Speed K-K rules - symbols per rule million searches per second for -bit keys Suitable even for Gb/s traffic IPv and IPv lookups are trivial with TCAM Extra symbolsare left in each entry, that can be used to optimize TCAM performance