Novel IP Address Lookup Algorithm for Inexpensive Hardware Implementation

Size: px

Start display at page:

Download "Novel IP Address Lookup Algorithm for Inexpensive Hardware Implementation"

Joel Horton
5 years ago
Views:

1 Novel IP Address Lookup Algorithm for Inexpensive Hardware Implementation KARI SEPPÄNEN Information Technology Research Institute Technical Research Centre of Finland (VTT) PO Box 1202, FIN VTT Finland Abstract: The key factor defining the efficiency of IP routers is the speed of the forwarding operation, that is the speed of determining the next-hop destination for each packet. The operation is not simple because the IP addresses are unstructured and the destination subnetworks can be overlapping. This requires so called longest match lookup operation. In this paper I propose a simple and very fast address lookup algorithm that can be easily implemented in hardware. It is designed for inexpensive systems and thus requires only standard SRAM and FPGA devices. However, its performance exceeds even the requirements of today s backbone routers and it allows for incremental forwarding table updates. Key-Words: IP routing, address lookup, Gigabit routers 1 Introduction The operation of the IP networks is based on connectionless datagram routing performed hop-by-hop from the source host to the destination host. While it is possible to define the route explicitly at the source end and include that information into the datagram, the existing networks operate solely in hop-by-hop routing mode to avoid excess overhead. The hopby-hop routing is based on determining the next-hop destination according to the destination address included in each datagram. The next-hop is defined by a forwarding table that is maintained in each network node doing routing operations, that is an IP router. A forwarding table contains the set of IP subnetwork definitions and for each subnetwork the address of desired next-hop destination. One of the key factors determining the efficiency of an IP router is the speed of the forwarding operation, that is the time it takes to resolve the next hop address based on the destination address of a packet. What makes it complicated is the fact that IP version 4 (IPv4) addresses are unstructured (so called classless interdomain routing, CIDR scheme) and thus simple lookup algorithms are not suitable. Moreover, the route specifications can be overlapping, that is there can be smaller address ranges specified inside a larger address range. These address ranges are called prefixes which are composed of a network address and its length. So there can be overlapping prefix definitions like /8 (meaning addresses with first 8 bits = 138) and /10. IP routing policy is defined so that always the longest prefix definition matching the destination address defines the next hop. This is the so called longest match principle [1]. An additional constraint in designing an efficient forwarding algorithm is the dynamic nature of routing information. In principle all routing information changes such as new routes and new subnetworks are visible to all backbone routers. These routers have to process the changes on the fly and to adjust the forwarding table accordingly. However, if updates require extensive computation, vast amount of memory accesses, or even total reconstruction of the search structure, it can degrade the performance of a router considerably. This could be quite serious in certain abnormal, but not rare, situations such as a failure of important backbone router or in case of route flapping [2, p. 233]. All this require a lookup structure that can be updated with a reasonable workload and without stopping the datagram forwarding. Until now there have been only two real alternatives to implement IP address lookup mechanism: either to use a general purpose CPU with a software based algorithm or to use an application specific in-

2 tegrated circuit (ASIC) with a hard-wired algorithm. Both approaches have their weaknesses like poor performance of software algorithms or long design period and large NRE costs of ASICs. However, the ongoing development on field programmable gate arrays (FPGA) and various fast and flexible memory devices such as zero bus turn around static random access memory (ZBT-SRAM) has created a third alternative approach. This combination offers performance par to the ASIC as well as the rapid and versatile development process of software. Furthermore, cleanly designed submodule implementing the lookup algorithm can be easily integrated into any FPGA or ASIC design requiring its functionality. In this paper I first give a short general view into the existing algorithms and point out some of their weaknesses in the light of inexpensive hardware implementation. Then I describe the proposed algorithm and show how it could be implemented efficiently using only simple general purpose hardware components. To conclude I present some result from performance simulations showing, e.g. memory consumption, search structure construction time, and average search times. 2 Existing Algorithms There are many excellent articles on classification and efficiency of different address lookup algorithms like [1, 3]. In this paper the previous work is not redone but it is taken advantage of to find out how suitable these algorithms could be for a FPGA implementation. There are many ways to divide the lookup algorithms in groups but I have used a quite crude method: they are divided into software and hardware based ones. The reason for this division is that the constraints given by the implementation environment differ considerably in those two groups. 2.1 Software algorithms A classical way to present prefixes is a tree based data structure called trie where the bits of prefixes are used to direct branching. A simple binary trie is straight forward to generate and it allows for easy incremental updates. However, there are some critical problems: the worst case search time is long (32 steps for 32-bit IPv4 addresses) and memory efficiency is not very good. The traditional way to overcome these problems has been the path compression techniques (PATRICIA and BSD trie). However, path compression does not guarantee short search times; actually the worst case search time remains the same or is even doubled if backtracking is used. Furthermore, the memory efficiency of these algorithms reduces as more prefixes is added, i.e. as the trie gets denser. There are two basic ways to achieve better performance: to use either multibit tries or prefix range search. However, the latter alternative does not guarantee short worst case search times and thus it is not considered further in this paper. The multibit trie operates on multiple bits simultaneously. The bits inspected in one step is called a stride. Depending on the algorithm all the strides in a trie can have the same size, i.e. fixed-stride multibit trie, or the strides can have different sizes, i.e. variable-stride multibit trie. There are trade-offs in selecting a suitable stride size: a large stride gives short search times as the trie has fewer levels but at the same time there will be lots of empty entries resulting in large memory image and harder updates. The existing multibit trie algorithms use several improvements to the basic scheme to achieve better memory efficiency. One method is to use path compression with multibit tries as in level compressed (LC) trie [4]. Another solution is to use compression with fixed-stride tries, e.g. as in Lulea algorithm [5]. There are also algorithms that rely on determining optimal stride sizes using different optimisation methods. However, these methods usually sacrifice the possibility for incremental updates: LC-trie is very hard to update, Lulea trie is impossible to update and due to compression requires excess memory references, and incremental updates degrade the efficiency of the methods based on optimisation [1]. All the high performance software algorithms are designed to have as small memory foot print as possible to take full advantage of fast cache memories. The primary motivation for this are the large penalties caused by cache miss; many times dozens of clock cycles. On the other hand these algorithms can take an advantage of the complex operations provided by general purpose instruction set. Carefully tuned algorithms can reach quite respectable performance levels: estimates as high as 87 million lookups per second have been reported [6]. However, the efficiency of the most aggressive cache based algorithms rely on the locality of the traffic which

3 cannot be guaranteed in backbone routers. In addition to the problems with incremental updates, the whole class of software based algorithms have some drawbacks that reduce their attractiveness. First of all, the I/O performance of general purpose computers just does not match the requirements for a Gigabit router. On the other hand, embedding a high performance general purpose CPU into a custom built router card is not simple nor inexpensive due to required support circuitry. Furthermore, the effectiveness is threatened from two directions: the increasing number of prefixes may result in a trie that is larger than the cache of the target system and better memory architectures in the future may reduce the complex compression methods from an advantage to a burden. 2.2 Hardware algorithms Using a hardware based address lookup algorithm has some obvious advantages: the memory architecture can be tailored for the requirements of the algorithm and complex bit manipulation operations can be implemented. Furthermore, explicit concurrency can be take advantage of instead of CPU architecture specific implicit one. However, the implementor of a hardware based system cannot rely on an existing apparatus like large on-chip caches or multiple pipelined execution units. Thus it is desired to keep HW-based algorithms as simple as possible. This very important especially with FPGA based implementations because it is quite impossible to implement, e.g. large and fast on-chip caches. One straightforward implementation alternative is to use a multibit trie with only couple of levels. The basic scheme uses two-level trie: 24-bit stride in 1st level and 8-bit strides in the 2nd level [7]. Because the number of prefixes longer than 24 bits is still quite low (1806 in NLARN data from March ), most of the lookups are performed in just single memory reference. The problem with this kind of trie is that in some cases updating the trie can take a long time, e.g. if a prefix of length 8 changes we must update at least = 2 16 entries. Furthermore, the memory requirements, about 33 MB, make it very costly to use the fastest SRAM devices, e.g. it would require 17 x 18 Mbit chips. Using synchronous DRAM devices is not a realistic option as truly random access memory references can take, e.g. 5 cycles to complete reducing the effective memory cycle speed to something like 30 MHz. A realistic option would be slow large capacity SRAM modules but in that case the memory cycle speed is in order of MHz. An alternative HW-based trie algorithm takes advantage of compression methods [8]. However, the ease of incremental updates is in danger again. While the authors of [8] try to belittle this problem by claiming that updates are required only once in s, this is a real problem as their method seems to make incremental updates impossible. The memory requirement of this scheme for a entry forwarding table is said to be MB but the results are obtained using a random prefix set. Standard content addressable memories (CAMs) cannot be applied directly for longest match operation. The reason for this is that the length of the prefix cannot be determined form the IP address without doing the longest match operation. The only direct way to take advantage of the properties CAM is to use them in implementing some fixed structure multibit trie algorithm. In this case the memory consumption can be quite modest even with simple search structure. However, CAM devices are more expensive, slower, and less spacious than ordinary random access memories. A special ternary CAM, that can store three values in each bit (0, 1, * = do not care), are suitable for longest match operation. However, they are even more expensive and have even less capacity than ordinary CAM devices [9, 10]. 3 Proposed Algorithm The requirements for an efficient and versatile HW based address lookup algorithm are: 1. Requires only a small number of memory references per a lookup operation 2. Allows for parallel and / or pipelined implementation 3. Can be incrementally generated and updated 4. Has only a modest memory foot print Requirements 1-2 are directly related to performance: external memories, even the fastest one, cannot match the speed of internal L1 or even L2 caches and thus each memory access is more expensive. On the other hand it is quite challenging task get a FPGA design

4 to run at speeds of fastest SRAM devices and therefore it is important that parallelism and pipelining can be used. While it may be true that instant route updates are not a strict requirement for current IP routers, it is not certain that the situation will remain so. Thus it is important to reserve a possibility to do fast incremental updates. As a HW implementation is not tied down into a limited CPU cache, the memory consumption is not so important issue. However, if we would like to take advantage of the latest and fastest memory technology, realistic memory space is today approximately 4-16 MB. The proposed algorithm is designed to be implemented using standard FPGA and SRAM devices. Using the leading edge FPGA devices there should not be any differences on performance level compared to an ASIC implementation. The reason for this is that the performance depends on the speed of the memory access which, in turn, depends on the memory device. The reason to use fast SRAM devices instead of some type of high bandwidth DRAM devices is the efficiency and simplicity of the memory access cycle. The current high bandwidth DRAM devices have many constraints on memory cycles that have to be considered if good performance is desired. On the contrary devices like ZBT-SRAM or quad rate SRAM offer totally penalty-less freedom on how the memory access should be done. However, there is a drawback: SRAM devices have less capacity and they are more expensive than DRAM. This requires careful search structure design to minimise the memory usage without performance penalties. 3.1 Compact Stride Multibit Trie (CS Trie) EMPTY: DIRECT: EXCEPT: LINK: The proposed algorithm is basically a fixed-stride multibit trie using 8-bit strides in each layer. This results in 4 layer trie with worst case search time equalling to 4 steps. However, due to large number of bits in each stride requiring large trie nodes, this kind of trie would need a large amount of memory. The CS-trie reduces both memory consumption and search time using conditional entries and compacted strides. While the generation of trie becomes more complicated than in the basic scheme, this is compensated by fewer memory references required for each update. The only real drawbacks are larger entries in trie leaves reducing memory savings and conditional masking machinery making FPGA cirtype type type type nhi nhi nhi prefix prefix plen plen Figure 1: Entry types pnhi pointer nsize cuitry more complex. The structure of CS-trie is such that there is a uncompacted root node containing 8-bit (256 entry) stride. All other nodes and leaves, i.e. in the levels 2-4 of the trie, are compacted 8-bit strides. Each n-bit trie contains 2 n entries containing 1-6 fields each depending on the entry type. The entry types (shown in Figure 1) are following: EMPTY No destination network defined for this address resulting no match. Only type field. DIRECT Direct match, type and next hop index (nhi) fields. EXCEPT Includes a prefix and prefix length; if the address matches the prefix next hop is defined by pnhi otherwise by nhi LINK As the EXCEPT but if the address matches the prefix, pointer defines the memory address of next level stride and nsize defines the size of that next stride. If the address does not match to the prefix, nhi defines the next hop index directly. Besides the trie there is an additional table, the next hop table, defining the required information (port number, MAC address, etc.) for datagram forwarding. The nhi field is used to address this table. There are two fixed nhi values: 0 defining local address, i.e. the network processor of the system and all ones defining invalid destination. The last value allows for deciding no match with EXCEPT and LINK entries. An important feature of CS trie are compressed strides. The basic idea is to have a trie node that is just large enough to make all entries in that node separable, e.g. if one 3rd layer leave contains only definitions for two networks, say /17 and /17, we can have size 2 trie node instead of full 256 entry node. Having EXCEPT entries and masked LINK entries make this even more

5 Network processor route calculation gen. trie upd. stack Forwarding engine Longest match funct. Lookup trie Interfaces rounting information Figure 2: Processing routing information and updating forwarding table efficient: now we can have very compact strides even if the length of the prefixes would otherwise require large nodes, e.g. if there are three subnetworks within 11.11/16: /20, /26 and /27, we can still have size 2 nodes in both L3 and L4 (L3[0] = EXCEPT /20, L3[1] = LINK /25, L4[0] = EXCEPT /26, L4[1] = EXCEPT /27). The trie is traversed using the following algorithm: entry <- rootnode[addr[31:24]] level <- 1 while(true) if entry.type == EMPTY return no_match, 0 if entry.type == DIRECT return match, entry.nhi if entry.type == EXCEPT if match(entry.prefix, addr) return match, entry.pnhi else if entry.nhi == return no_match, 0 else return match, entry.nhi if entry.type == LINK if match(entry.prefix, addr) entry <- trie[entry.pointer + indx(addr, level) << (8 - entry.nsize)] level <- level + 1 else if entry.nhi == return no_match, 0 else return match, entry.nhi The sizes of the fields are following: the type field is 2-bit (enough to identify 4 types), both next hop index fields are 10-bit allowing for 1022 next hop destinations, the prefix field is 24-bit (first 8 bits are defined implicitly by root node) and the prefix length field is 5-bit. The size of the pointer pointer has been chosen to be 19 bits that allows us to address 8 MB of memory in 128-bit blocks. The size of the next size field is determined by maximum node size (256 entries): as node sizes are always power-of-2, 3 bits are enough. This means that the length of the longest entry (LINK) is 63 bits and thus 64 bits has to be used for each entry. 3.2 On-the-fly Updates The CS-trie does not rely on aggressive compression methods nor on rigid calculation of the next entry address. This gives us an opportunity to use simple incremental updates just as with PATRICIA trie. Hardware based CS trie has two separate data structures: one for generating the trie and one for performing address lookups (Figure 2). The former data structure is used by a network processor responsible for processing routing information messages and calculating forwarding table updates. The latter one is located inside the forwarding engine. This allows for storing extra information into the former data structure to make it easier to calculate the updates. The basic idea with on-the-fly updates is to use a stack for insertions and deletions. When a new prefix is added into the generating trie at the network processor, the changes are pushed into a stack. After the insertion is complete the changes are poped from the stack and updated into the lookup trie. In this way the update caused by an insertion is done

6 L1 (256x64 bit) External mem. (L2 L4) table updates Addr[31:24]..... Addr[31:0] IP address L1 entry L1 match L2 pointer Addr[31:0] L2 4 entry Memory arbitrator L2 4 match result, NH index Mem addr NH index result, NH index.... result dispatcher next hop next hop table result next hop Figure 3: HW architecture of the proposed implementation of CS Trie based address match system. from leaves to root while the actual calculation advances from root to leaves. This procedure ensures that a concurrent lookup operation will never run into an incomplete subtree. A prefix deletion proceeds from leaves to the root and the changes are again first pushed into a stack and then poped and updated into lookup trie. In this way the deletion updates proceed in opposite direction and, again, a concurrent lookup is not interfered. One very important feature enabling low workload incremental upgrades is dynamic memory management. As nodes can be added, deleted, and resized on-the-fly, monolithic single table trie cannot be used. Instead the memory is divided into pages large enough to contain one full-sized node (256 entries, i.e. 2 kb) and each page can be divided further into subpages depending on the node size. To make the memory management a bit easier one page is always divided into same sized subpages. There are list structures in the network processor to manage, i.e. allocate and deallocate, free pages and subpages. The dynamic memory management guarantees that the changes in the trie structure have only local impacts. 3.3 Pipelined Implementation The pipelined implementation (Figure 3) takes advantage of small internal memory blocks provided by FPGA devices. The L1 node (root node) and the next hop table are placed into these memory blocks instead of the external memory. This enables concurrent memory accesses into these tables and into the external memory containing the rest of the trie (L2-L4). This reduces the number of required external memory accesses by 2 per each address match operation. It should be noted that the idea to use the internal memory was the main reason to use stride trie a large, e.g. 16 stride would be way too large to fit in. The operation of the address match system is following: 1. The first 8 bits of the IP address are used to access L1-table; L1-match unit does the first round of trie traverse and forwards the results (match + nhi or pointer + address) to one of the next two units. 2. The current pointer is used to address the external memory; L2-4-match unit performs corresponding round of the trie traverse and either update the pointer (only with LINK entry) or forward the result to next stage. 3. Result dispatcher gets results from L1- and L2-4-match units and accesses next hop table according to the results. 4. Memory arbitrator gets addresses form match units and trie updates from control unit (not shown). It schedules the memory cycles for the request form different units and hides memory latencies. All these tasks are performed concurrently. If the match units cannot perform entry processing at the rate of memory cycles, they can be pipelined too. In the Figure 4 there is one possible parallel and pipelined architecture shown. At the first stage the

7 IP address Prefix Prefix length next size table pointer Level =1 Mask gen. Level next index mask pipelining cutset Match? next pointer entry type nhi pnhi pipelining cutset result selector address NH index result Figure 4: An example of highly parallel pipelined implementation of the Match unit. address is XORed with the prefix, a mask is created according to the prefix length, and the index to the entry at the next level stride is calculated (of course these values have no meaning for DIRECT or EMPTY entries). At the second stage the results of XOR operation together with the mask are used to check for a prefix match and the index and pointer are used to calculate the address of the next level entry. At the last stage the value of the entry type field is used to select the results to be used for the decision of the correct outcome. 3.4 Performance estimates To get some realistic performance estimates a test trie was generated using a real world routing table and then a set of address matches was carried out. The routing table containing entries used in these test was obtained from NLARN Measurement and Network Analysis Group Web site 1 and it was dated March The resulting trie required kb pages, i.e. approximately 4.3 MB. The compacted strides saved a considerable amount of 1 Table 1: The average level of the trie where a match is found for different prefix lengths. Prefix len. Trie lev. Prefix len. Trie lev. No match memory (57%) as there were 5091 nodes in the trie. It took 530 ms to generate the trie in 480 MHz Sun UltraSPARC-II; about 5.1 µs per each entry. However, I believe that the generation time can be made substantially shorter, if the trie generation would be carefully optimised. Compared to LC-trie [4] the memory consumption was quite competitive: a LC-trie generated with same data required 2.5 MB (trie *4 bytes, base vector 96712*16 bytes, prefix vector 7803*12 bytes). Time to build LC-trie was 490 ms. I was unable to test CS-trie with the same data as was used in [4]; the data provided by Nilsson 2 contained invalid addresses 3. I think that these results are encouraging as the CS-trie is not designed for minimum memory consumption and the incremental CS-trie construction program does already take into account the time required to synchronise memory contents between the network processor and the forwarding engine. For longest match operation performance estimates a large set of random uniformly distributed IP addresses (10 8 ) was generated. The reason for this approach was simply the lack of suitable traffic trace. The estimates are shown in table 1. It can be I am afraid that these errors may affect the results reported in [4].

8 Table 2: Trie size and generation time with different routing table sizes. The routing information is obtained from NLARN and is dated November 8 except year 2001 which is from March 16. Year Routing Trie size (MB) Memory Prefix L4 strides Generation time entries per entry (B) length > 24 total (ms) per entry (µs) noted that declaring no match takes a very short time. However, this result may be misleading as it is unlikely that addresses of non-existing destinations are uniformly distributed. More results for different routing table sizes are shown in the table 2. It seems like that the memory requirements and construction complexity are growing in a linear manner. Furthermore, it can be noted that the number of L4 leaves remain minimal while the number of long prefixes has grown ten fold during the years. Updating the search structure at the forwarding engine by copying it from network processor does not take too long nor it does reserve a high portion of memory cycles. Let s consider a simple situation where forwarding engine and network processor are interconnected by a 33 MHz 32-bit PCI bus. If we could use the bus with efficiency of 60%, a total rewrite of, e.g. 4 MB search structure takes only 53 ms. Furthermore, at the same time only 7.4% of the memory cycles at the routing engine is required for the update, if 133 MHz 64-bit ZBT-SRAM devices are used. Thus, any forwarding table updates are carried out in a short time and without noticeable impact on forwarding performance. 3.5 IP version 6 At this point one may wonder how this proposed table lookup algorithm could be upgraded to support long IPv6 address. However, the whole question is more or less absurd: IPv6 addressing is hierarchical [11]. One of the key ideas of adopting the 128-bit address format was not only to guarantee an address space that is more than adequate but also to get rid of the cumbersome CIDR addressing. This means that with IPv6 there is no need for an algorithm that performs well in longest match operations. One exception is IPv4 compatible addressing mode but then the addresses are 32-bit and thus they can be handled with standard IPv4 address lookup. I think that the best way to upgrade an existing router architecture to support IPv6 is to add an separate IPv6 module. This module can take advantage of efficient hierarchical search methods. Furthermore, the search structures are likely to be quite smallish as fine-grained network topology can be efficiently hidden. In other words the prefix aggregation should really work with the IPv6 addressing. 4 Future work In the future our team is planning to create a detailed VHDL description of the pipelined forwarding unit to be able to simulate its performance. By using exact timing information feedback from place and route process quite accurate estimates can be obtained. However, based on our previous experiences I am quite sure that the implementation can run at MHz in our target system (6 Mgate Xilinx Virtex-II). Our final goal is to include this design into our distributed router system that is also under development. One obvious target for improvements in CS trie is the memory consumption: the size of a entry is defined by the size of the longest entry. However, over 90% of entries are either EMPTY or DIRECT type. If some kind of split memory space or dual memory scheme could be used, considerable amounts of memory can be saved. Quick approximate estimates show that the EMPTY and DIRECT entries could be stored in 16 bits instead of 64 bits. In this way the size of the memory required by the trie could be reduced to just 1/3 of the original, e.g from

9 4.3 MB to 1.4 MB. However, this requires further study; it must be made sure that the performance does not degrade and the basic structure does not become too complex. There are also other possibilities to improve CS trie like using arbitrary bit masks with LINKs and adding extra prefixes into EXCEPT entries at the lowest level. However, it is unsure if these modifications can have any measurable impacts on the performance and thus further studies are required again. 5 Conclusions A novel address lookup method, the CS-trie, was described and its efficiency was demonstrated. I have shown that it is possible to have a highly efficient HW based trie without sacrificing the possibility for incremental updates. Furthermore, it was shown that by using few simple features, i.e. EX- CEPT entries and compacted strides, it is possible to have large memory savings as well as improved lookup performance. I have also introduced a principle of two tries (generating and lookup) that make it much easier to calculate incremental updates. What is more, the results show that memory consumption and generation time of CS-trie are growing in linear manner. An example of inexpensive hardware implementation was also given. A high performance CS-trie based system can be realised using standard FPGA and SRAM devices. Such system could easily perform approximately million address lookup operations per second which is more that adequate for current bit rates (60 Mlookups with 40 byte packets = 19.2 Gbit/s and 95 Mlookups with 250 byte packets = 190 Gbit/s). Acknowledgements I would like to thank the National Science Foundation Cooperative Agreement No. ANI , the National Laboratory for Applied Network Research Measurement, and its Network Analysis Group for kindly providing public access to the routing information used in this work. References [1] Miguel Á. Ruiz-Sánchez, Ernst W. Biersack, and Walid Dabbous. Survey and taxonomy of IP address lookup algorithms. IEEE Network, 15(2):8 23, [2] Christian Huitema. Routing in the Internet. Prentice Hall, 2nd edition, [3] Henry Hong-Yi Tzeng and Tony Przygienda. On fast address-lookup algrithms. IEEE Journal on Selected Areas in Communications, 17(6): , [4] Stefan Nilsson and Gunnar Karlsson. IPaddress lookup using LC-tries. IEEE Journal on Selected Areas in Communications, 17(6): , [5] Mikael Degermark, Andrej Brodnik, Svante Carlsson, and Stephen Pink. Small forwarding tables for fast routing lookups. In Proceedings of SIGCOMM 97 Cannes, France, [6] Tzi-cker Chiueh and Prashant Pradhan. Highperformace IP routing table lookup using CPU caching. In Proceedings of IEEE INFOCOM 1999, volume 3, pages , [7] P. Gupta, S. Lin, and N. McKeown. Routing lookups in hardware at memory access speeds. In Proceedings of IEEE INFOCOM 1998, pages , [8] Nen-Fu Huang and Shi-Ming Zhao. A novel IP-routing lookup scheme and hardware architecture for multigigabit switching routers. IEEE Journal on Selected Areas in Communications, 17(6): , [9] Marcel Waldvogel, George Varghese, Jon Turner, and Bernhard Plattner. Scalable high speed IP routing lookups. In Proceedings of ACM SIGCOMM 97, [10] Zhongchao Yu, Jianping Wu, Ke Xu, and Mingwei Xu. A fast IP classification algorithm applying to multiple fields. In Proceedings of IEEE ICC 2001, [11] Steve King, Ruth Fax, Dimitry Haskin, Wenken Ling, Tom Meehan, Robert Fink, and Charles E. Perkins. The case for IPv6. draftietf-iab-case-for-ipv6-06.txt.

Frugal IP Lookup Based on a Parallel Search

Frugal IP Lookup Based on a Parallel Search Zoran Čiča and Aleksandra Smiljanić School of Electrical Engineering, Belgrade University, Serbia Email: cicasyl@etf.rs, aleksandra@etf.rs Abstract Lookup function