Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table H. Michael Ji, and Ranga Srinivasan Tensilica, Inc. 3255-6 Scott Blvd Santa Clara, CA 95054 Abstract--In this paper we examine a primitive data structure for routing lookup called 24/8 which requires about 32 MB memory to store the routing table information. A novel compression algorithm is used to reduce the memory requirement to about 3 MB. The data structure is common for both route lookup and update. We present a fast route lookup algorithm and an efficient update algorithm that supports incremental route update. A configurable processor is used to achieve fast IP route lookup. Through configuring the processor properly and developing a few customized instructions specifically for route lookup, we can achieve up to 66 million lookups per second (MLPS) for the processor at 200 MHz. I. INTRODUCTION Since last decade the Internet has grown substantially in terms of continuously increasing traffic amount and number of routers/hosts added to the network. One of the major functions of IP routers is packet forwarding, which is basically doing a routing table lookup based on the IP destination field in the IP packet header and identify the next hop to which the incoming packet should be sent. Primarily three approaches have been used for IP route lookup, namely, pure software, pure hardware, a combination of software and hardware. For software approach, it is taken in [1] where it is reported that 2 million lookups per second (MLPS) can be achieved by using a Pentium II 233 MHz with 16 KB L1 data cache and 1MB L2 cache. It requires 120 CPU cycles per lookup with three level trie data structure (16/8/8). Another software approach is taken in [2] where it compressed routing table into a small forwarding table, which can be fit into the cache memory of an ordinary PC. It requires about 100 instructions per lookup and is claimed capable of performing 4 MLPS using Pentium 200 MHz processor. As to hardware approach, it has been taken by many IP router vendors. For example, Juniper Networks designed an ASIC called Internet Processor, which is a centralized forwarding engine with capacity of 40 MLPS by using more than one million gates. For network processor approach, it has been becoming popular recently. For example, the product np3400 from MMC Networks supports 6.6 million packet processing per second (MPPS) with 200 MHz processor optimized for packet processing. IXP1200 network processor from Intel uses 1 StrongARM microprocessor with 6 independent 32-bit RISC microengines to forward 3 MPPS. In this paper, by using a configurable processor at 200 MHz we can achieve about 33 MLPS by configuring the processor properly and adding a few customized instructions which are optimized for the IP packet lookup application. Routing table lookup requires longest prefix matching which is a much harder problem than exact matching. The most popular data structure for longest prefix match is Patricia trie or level compressed trie, which is basically a binary tree with compressed levels [11]. A similar scheme called reduced radix tree has been implemented in Berkeley UNIX 4.3 [9]. Content Addressable Memory (CAM) is used for route lookup but it only supports fixed length patterns and small routing tables [6]. A technique of expanded trie using controlled prefix expansion is introduced in [10] for fast route lookup. In [4] it uses bitmap to compress the routing table so that it can be fit into small SRAM and achieve fast lookup speed. In order to add a new route into the table, the update algorithm in [4] requires sorting and preprocessing all the existing routes with the new route, which is very expensive computation. In other words, the algorithm in [4] does not support incremental route update. A large DRAM memory is used in [3] to store 2 level routing tables. The most significant 24 bits of IP destination address is used as an index into the first level while the remaining 8 bits are used as offset into the second table. This is so called 24/8 scheme. The data structure requires 32 MB memory for the first level table while much less memory for the second level. In this paper, we design a similar data structure as 24/8 in [3]. The difference is that we use a common data structure for both lookup and update [5]. Furthermore, we compress the 24/8 data structure to 24/8c that only requires about 3 MB. The main contributions of this paper are as follows: design a more compact and common data structure for route lookup as well as update, develop fast IP route lookup algorithm to achieve 33 MLPS by using a configurable processor with proper configuration and adding a few customized instructions, develop novel route update algorithm which supports incremental update. The rest of this paper is organized as follows. First we analyze routing table traces from backbone routers in Section 2. This analysis paves the way for us to design appropriate data structure for routing information base in Section 3. In Section 4 we present fast IP route lookup and update algorithms. In Section 5, we develop a few customized instructions to accelerate route lookup. Section 6 concludes the paper with perspective and summary. 0-7803-7206-9/01/$17.00 2001 IEEE 2373

II. ROUTING TABLE TRACE ANALYSIS The routers in the whole Internet are organized in a loosely hierarchical fashion. Most of the backbone routers are operated and owned by major service providers. These routers have default free routing tables, i.e., they are supposed to recognize all the incoming packets with various IP destination addresses. It means that they don t need to use default route for incoming data packets. Typically there are 100,000 entries in backbone routers while this number continues to increase as more hosts and routers are deployed. The enterprise routers, which are used by campus or organizations, have less entries (about 1,000). But some of the enterprise routers for large organizations may have large routing tables and look like backbone routers. A routing table entry stores IP address prefixes only (CIDR) [8]. Since each next hop is connected to one of the egress line cards, we can use egress (or output) port number to represent the next hop. So, a routing table entry will be in the format like (IP address, mask or prefix length, output port number). We use the routing tables from 5 major backbone routers: mae-east, mae-west, AADS, PacBell, and Paix. These routing tables are made available by Internet Performance Measurement and Analysis Project [7]. In Table 1, we give the number of routes for certain prefix length ranges collected on 10/03/2000 at these major backbone routers network access points (NAPs). From this table, we have the following observations: There is no routes with prefix length less than 8 in these default-free backbone routers. More than 50% of total routes have prefix length of 24. Most of the routes (more than 99%) have prefix length from 16 to 24. The number of routes whose prefix length is more than 24 is less than 100 (in percentage wise, it is less than 0.3%). Most of the routes are in Class C. The number of next hops at these backbone routers is less than 100. TABLE 1 NUMBER OF ROUTES FOR PREFIX LENGTH INTERVALS Prefix Range Mae-East Mae-West AADS PacBell Paix 0--7 0 0 0 0 0 8--15 148 245 155 195 592 16--23 11750 14712 13135 15980 35974 24 11961 17084 15379 20312 50460 25--32 64 55 62 71 76 Total 23923 32096 28731 36558 87102 III. DATA STRUCTURE OF 24/8 AND 24/8C SCHEMES A data structure called 24/8 was defined in [3][5] where the first 24 bits of an IP destination address contained in an IP packet are used as index to the first level table while the remaining 8 bits are used as index to the second level table. The 24/8 data structure requires about 32 MB memory. First we define the data structure that is similar to [3]. The major difference is that we do not need to have a separate data structure for route update. In other words, our data structure is common for route lookup and update. It also means that we don t necessarily need to store a separate update routing table. Then we design a compressed 24/8 (called 24/8c) data structure that reduces the memory requirement to about 3 MB. A. Data Structure of 24/8 Scheme For IPv4 packet with 32-bit IP address, the most significant 24 bits are grouped together and called segment and the remaining 8 bits called offset in [3][4]. We create two level tables to store routing information base (RIB): namely T1_RIB and T2_RIB. We use the most significant 24 bits of an IP destination address as an index to T1_RIB. The index to T1_RIB is from 0.0.0 (for the first entry) to 255.255.255 (for the last entry). Totally T1_RIB, has 2 24 entries. Each entry in T1_RIB is 2 bytes. So, the total size of T1_RIB is 2 24 * 2 bytes = 32 Mbytes. Each entry in T1_RIB stores next hop and prefix length (NHPL) information if there is not any route, whose prefix matches the index of this entry, with prefix length more than 24. If there is one or more routes associated with this entry having prefix length more than 24, this entry stores the base address pointing to T2_RIB which has 256 entries. For those entries in T1_RIB which store the base addresses pointing to T2_RIB, they will use distinct base addresses. We use the remaining 8 bits in IP destination address as an offset pointing to a particular entry at T2_RIB. Each entry in T2_RIB is 2 bytes that store the next hop and prefix length (NHPL) information [5]. Since the number of routes with prefix length more than 24 is less than 100 as shown in Table 2, the size of T2_RIB is less than 100*256 * 2 = 50 Kbytes. So, we really need to compress the first level table in order to minimize the memory requirement (to be shown later). For each route, we need to store the next hop and prefix length information. The reason that we need to store the prefix length of each route entry is for the purpose of route update. More discussion on how to update the routing table will follow shortly. For T1_RIB entries, we use the bit fields as follows: the most significant bit, NHPL[15] is a marker bit. If NHPL[15] is 0, NHPL[14:6] stores next hop information, and NHPL[5:0] stores prefix length of the route associated with that entry. Otherwise, NHPL[14:0] stores index information into table T2_RIB. Note that these 15 bits can cover the range from 0 to 32,767 which is far more sufficient for indexing into the 2 nd level table. For each entry in T2_RIB, we use the first 10 bits to store next hop while the remaining 6 bits to store the prefix length associated with the entry. We notice that the first level table T1_RIB stores redundant information. For example, suppose there is a route 2374

(128.3.0.0/16/1 in the format of prefix/prefix length/next hop) and there is no other routes beginning with 128.3. Then all the 256 entries (from 128.3.0 to 128.3.255) in T1_RIB store the same information. We analyze T1_RIB entries from various backbone routes. We divide T1_RIB into blocks. Each block has 2^6 = 64 entries and totally we have 2^18 blocks for the whole T1_RIB table 1. Specifically we group the entries from 0.0.0 to 0.0.63 as block 1, 0.0.64 to 0.0.127 as block 2,, 255.255.192 to 255.255.255 as block 262,144. For each block which has 64 entries, we define a counter. We initially set the counter to 1 and then scan one entry at a time from the 2 nd entry to the 64 th entry in the block. Each time the counter will be increased by 1 if the current entry NHPL[15:0] is different from its previous entry. Note that if the marker bit in the entry is 1, this entry will definitely be different from its previous entry as well as the following entry since T1_RIB has distinct indexes into the second level tables. We define this counter as the dimension of NHPL, namely dim(nhpl), for that block. By analyzing the backbone routing table trace, we found that more than 98% of the blocks whose dim(nhpl) is equal either to 1 or 2. The maximum dim(nhpl) can be from 33 to 44 while the average dim(nhpl) is in the range from 1.08 to 1.14. It shows that there is a lot of redundancy in T1_RIB. In the next subsection, we ll design a data structure using bitmap to compress T1_RIB to about 3 MB. B. Data Structure of 24/8c Scheme We design a novel scheme to compress the routing table. This is called 24/8c scheme. In 24/8c data structure, for each entry t1_entry in the first level table T1_RIB it has 12 bytes (96 bits). The fields of T1_RIB table entries are used as follows: 1. t1_entry[95:32]: store 64 bit bitmap. The most significant bit t1_entry[95] is always set to 1. For a bit at position say K, the number of leading 1 s from the most significant bit to this position (including the bit at this position) gives the index into the NHPL array which stores next hop/index and prefix length information. 2. t1_entry[31:0]: store 1 to 2 NHPLs or a 32-bit address. If all_ones, which is the total number of 1 s in the bitmap t1_entry[95:32], is 1, t1_entry[31:16] stores NHPL[1]. If all_ones is 2, t1_entry[31:16] stores NHPL[1], t1_entry[15:0] stores NHPL[2]. Otherwise, i.e., all_ones > 2, t1_entry[31:0] stores 32-bit address which points to where the NHPL array is stored (i.e., t1_entry[31:0] = &NHPL[1]). An extended T1_RIB table is used for those entries in T1_RIB whose dim(nhpl) is more than 2 (less than 2%). Each entry in the extended T1_RIB is 2 bytes storing next hop/index and prefix length information. In this compression data structure, NHPL[i] will definitely be different from its previous NHPL[i-1] and the following NHPL[i+1]. From 1 The block size can be chosen other value. We choose to use 64 since this can be fit into a 128-bit processor interface width. analyzing the routing traces from those backbone routers, we observe that the size of extended T1_RIB tables is less than 80 Kbytes. We use the most significant 18 bits of IP destination address as an index into T1_RIB. The first level table T1_RIB has 2 18 entries each of which is 12 bytes. Totally T1_RIB table is of size 3 MB. We use the following 6 bits of IP destination address as an index into the bitmap position at the entry t1_entry[95:32]. For example, given an IP address with 128.3.255.0, the first 18 bits are 128.3.3. So, it will index into the entry t1_entry = T1_RIB[128.3.3]. The following 6 bits are 6 b1. It will be mapped to the bit map at t1_entry[32]. The second level table T2_RIB in 24/8c scheme is the same as that in 24/8 scheme. If the size of table T2_RIB grows in the future, we can use bitmap in the similar fashion to compress it too. In the next section, we ll design route update algorithm to create the bitmap and NHPL array for 24/8c scheme without creating T1_RIB of 24/8 scheme first and then analyzing it. In other words, we can create T1_RIB table for 24/8c scheme directly. IV. ROUTING LOOKUP AND UPDATE ALGORITHM A. Routing Lookup Algorithm Upon receiving an IP data packet with its destination address as ip_addr[31:0] at the ingress line card, it will take the following steps for routing table lookup: 1. Extract the 32-bit destination IP address field from the packet header ip_addr[31:0]. Divide the 32-bit IP address into segment and offsets: segment[17:0]=ip_addr[31:14], offset1[5:0]=ip_addr[13:8], offset2[7:0]=ip_addr[7:0]. 2. Using segment[17:0] as an index into T1_RIB. A single cache (if it is a cache hit) or memory (cache miss) read will be performed to yield 12-byte result result[95:0] = T1_RIB[segment]. 3. Compute the number of total leading 1 s in bitmap result[95:32], say all_ones. If all_ones <= 2, result[31:0] gives NHPL. Otherwise, result[31:0] is an address pointing to where NHPLs are stored. Get the position bit K = 95 - offset1. Compute the number of leading 1 s in result[95:k] say leading_ones. Retrieve next hop/index and prefix length information NHPL[leading_ones]. 4. If the marker bit of NHPL[15] is 0, which indicates that we do not need to access second level table T2_RIB, the next hop is given by NHPL[14:6]. 5. Otherwise, we compute the index into the second level table by multiplying NHPL[14:0] by 256 and add the product to the last 8 bits of the original IP destination address, i.e., index = NHPL[14:0] << 8 + ip_addr[7:0]. 6. We need one more cache (if it is a cache hit) or memory (cache miss) read to get result[15:0] = T2_RIB[index]. The next hop is given by result[15:6]. 2375

B. Routing Update Algorithm Upon receiving an IP routing control packet, which contains the information in the 3 tuple (ip_addr, prefix_length, next_hop), it requires to update the routing table. We compute the new NHPL as new_nhpl[15:0] = next_hop<<6 + prefix_length. If the new NHPL is different from what is stored in the table, and the new route is more specific than what is stored in the table, we need to modify the table. Three cases need to be considered: prefix_length <= 18; 18 < prefix_length <= 24; prefix_length > 24. Let us consider each case separately. Case 1: prefix_length <= 18 In this case, we do not need to change the bitmap. The reason is that for this case it matches the whole 64 bits in the bitmap. It implies that dim(nhpl) is not changed. Only the contents of NHPL array may need to be updated. The route update will match one or more entries in T1_RIB. For example, it matches exactly one entry for prefix_length 18. If prefix_length is 8, it will match 2^(18-8) = 1K entries in T1_RIB. For each matched entry, we need to walk through the whole NHPL array and see whether we need to change the NHPL. For each NHPL, if the marker bit is 0, we obtain the old prefix length stored in the table. If the old prefix length is less or equal to the new prefix_length and the new NHPL is not equal to the old NHPL, we need to replace the old NHPL stored in the table with new NHPL. If the marker bit is 1, we need to get the index into T2_RIB and scan the whole 256 entries in T2_RIB. For each entry in T2_RIB, if the old prefix length is less or equal to the new prefix length, we need to replace it with new NHPL information. Case 2: 18 < prefix_length <=24 In this case, it will match exactly one entry in T1_RIB. For the matched entry, it may match one or more bits in the 64-bit bitmap. For example, for prefix_length 24 it matches exactly one bit in the bitmap. For prefix_length 20, it matches 2^(24-20) = 16 bits in the bitmap. For each matched bit, we need to walk through the bitmap from left to right and see whether we need to change the bitmap based on various conditions. We only need to update the bitmap and NHPL array if the old prefix length is no more than the new prefix length and the new NHPL is distinct from the old NHPL stored in the table. If the marker bit in the old NHPL is 0, we need to consider many cases. Let us use P, C, and F to represent the NHPL associated with the previous bit, current bit, and the following immediate bit whose value is 1, respectively. We use N to represent the new NHPL. Consider the current bit position at the beginning. The bit value can only be 1 since the most significant bit is always set to 1. We need to consider two cases here: the following bit is 0 or 1. If the following bit is 0, this means that the next bit has the same NHPL as the first bit in the table. Since we need to update the NHPL C associated with the first bit. This requires to change the NHPL array from C F to N C F, bitmap from 10 to 11. For the other case where the following bit is 1, if the new NHPL N is not equal to F, we just need to change the NHPL array from C F to N F by replacing N with C. We don t need to change the bitmap 11. In this case the dimension of NHPL array is not changed. If the new NHPL N happens to be the same as F, we need to change NHPL array from C F to F by deleting the current NHPL. In this case the dimension is reduced by 1. The bitmap is changed from 11 to 10. Without explaining each case, we list all the cases in Table 2 which gives the old bitmap, the condition under which the bitmap and NHPL array should be changed, the new bitmap, the old NHPL and new NHPL, change of NHPL array dimension, etc. TABLE 2 BITMAP AND NHPL CHANGES FOR 18 < P REFIX LENGTH <= 24 Current Bitmap NHPL and its dim position old new Condition old new dim begin 10 11 No cond C F N C F +1 11 11 if (N!=F) C F N F 0 10 else C F F -1 middle x00x x11x No condition P F P N P F +2 x01x x11x if (N!=F) P F P N F +1 x10x else P F P F 0 x10x x11x if (N!=P) P C F P N C F +1 x01x else P C F P C F 0 x11x x10x if (N==F) P C F P F -1 x01x elseif(n==p)p C F P F -1 x11x else P C F P N F 0 end x0 x1 No condition P P N +1 x1 x1 if (N!= P) P C P N 0 x0 else P C P -1 If the marker bit in the old NHPL is 1, we need to get the index into T2_RIB and scan the whole 256 entries in T2_RIB. For each entry in T2_RIB, if the old prefix length is less or equal to the new prefix length, we need to replace it with new NHPL information. Case 3: prefix_length > 24 In this case, the new route will match exactly one entry in T1_RIB and one bit in the 64-bit bitmap. We only need to update the bitmap and NHPL array if the old prefix length is no more than the new prefix length and the new NHPL is distinct from the old NHPL stored in the table. If the marker bit in the old NHPL is 0, we need to get a new distinct index in the range from 0 to 32,767 (which has not been used in T1_RIB) and fill it into the entry at T1_RIB and mark the first bit to 1. The index will point to the beginning address of T2_RIB with 256 entries. For those unmatched T2_RIB entries we fill it up with the old NHPL stored in T1_RIB entry while the remaining matched T2_RIB entries with new NHPL. We need to consider all the cases listed in Table 3. Based on the bit pattern of the current bit and the following 2376

bit, we need to update the bitmap and NHPL array. If the marker bit in the old NHPL is 1, we need to get the index into T2_RIB and update the matched entries in T2_RIB with new NHPL if the old prefix length there is no more than the new prefix length. All cases are listed in Table 3. TABLE 3 BITMAP AND NHPL CHANGES FOR PREFIX LENGTH > 24 Current Bitmap NHPL and its dim position old new Condition old new dim begin 10 11 No C F N C F +1 11 11 No C F N F 0 middle x00x x11x No P F P N P F +2 x01x x11x No P F P N F +1 x10x x11x No P C F P N C F +1 x11x x11x No P C F P N F 0 end x0 x1 No P P N +1 x1 x1 No P C P N 0 V. SIMULATION & PERFORMANCE ANALYSIS To evaluate the performance of the 24/8c data structure and lookup/update algorithms, we implement the algorithms in C language. The software can run on any processor platform which supports C program. In the simulation described below, we use a processor called Xtensa, which is a high-performance and configurable 32-bit RISC like microprocessor core [12]. Xtensa allows to configure the processor such as bus width, cache size, cache line size, the number of interrupts, etc. It also supports Tensilica Instruction Extension (TIE) language (its syntax is similar to Verilog) which can be used to describe extensions to the core instruction [12]. Using TIE to add instruction extensions can be quite useful for optimizing functionality and performance in specific applications. We develop a few customized instructions as follows to speedup route lookup: 1. t1_lookup: load the 128 bit entry from T1_RIB. 2. t1e_lookup: load NHPL either from T1_RIB table or the extended T1_RIB table by computing the total number of one s and leading ones. 3. t2_lookup: load next hop from T2_RIB if the marker bit is one. We configure the processor with the following key parameters: 128 bit processor interface (PIF), 32 registers, 4- way set associate caches with 16 bytes line size, cache size 16 Kbytes, clock frequency with 200 MHz, etc. Since the extended T1_RIB is quite small, we can put it into the onchip memory. Table T2_RIB is small and can also be put into on-chip memory. Since T1_RIB is 3 MB, it is not practical to put the whole T1_RIB table into on-chip memory in the current technology. Thus our lookup algorithm requires at most 1 off-chip memory access (to T1_RIB table) and at most 2 on-chip accesses (to the extended T1_RIB table and T2_RIB) for each lookup. Note that each load instruction has a 2 cycle latency for on-chip memory access case. For off-chip memory access it require 8 cycles (including processor stale cycles). So in the worst case t1_looup needs 8 cycles, t1e_lookup and t2_lookup needs 2 cycles. In total we need 12 cycles per lookup. In the typical case where cache hit will occur, we need 6 cycles per lookup. For a processor at 200MHz this translates to 16.67 MLPS for the worst case and 33 MLPS for the typical case. This performance can be further scaled up to 66 MLPS by processing multiple packets at the same time to hide the extra load cycle. We have also performed hardware synthesis for the configurable processor with extended instructions. We find that it needs about 65K gates for the configured Xtensa Core processor and an additional 6.5K gates for the ext ended TIE instructions. VI. PERSPECTIVE AND SUMMARY In this paper, we have defined a novel data structure to compress 24/8 routing table from 32 MB to about 3 MB. Through developing a few extended instructions for a configurable processor at 200 MHz, we can achieve up to 66 MLPS at wire-speed packet forwarding without buffering. The data structure we presented can be used for both route lookup and update. It is primitive and can support both large forwarding table for carrier class/backbone routers and small forwarding table for edge routers. The lookup scheme introduced supports longest prefix matching. The update algorithm is also presented which supports incremental route update. Using the same compression scheme, we have also developed another data structure to further compress the routing table to less than 0.5 Mbytes. REFERENCES [1] T. Chiueh and P. Pradhan, High-Performance IP Routing Table Lookup Using CPU Caching, IEEE INFOCOM 99. [2] M. Degermark, et al, Small Forwarding Tables for Fast Routing Lookups, ACM SIGCOMM 97, Palais des Festivals, Cannes, France. [3] P. Gupta, S. Lin and N. McKeown, Routing Lookups in Hardware at Memory Access Speeds, IEEE INFOCOM 98, San Francisco, April 1998. [4] N. Huang and S. Zhao, A Novel IP-Routing Lookup Scheme and Hardware Architecture for Multigigabit Switching Routers, IEEE JSAC, vol. 17, no. 6, June 1999. [5] H. Ji and R. Srinivasan, Xtensa Processor Extensions for Fast IP Packet Forwarding, Application Note, Tensilica, Inc., Nov. 2000. [6] A. McAuley and P. Francis, Fast Routing Table Lookup Using CAMs, IEEE INFOCOM 93, San Francisco, CA, March 1993. [7] Merit Network, Internet Performance Measurement and Analysis (IPMA) Project, http://www.merit.edu. [8] Y. Rekhter and T. Li, An Architecture for IP Address Allocation with CIDR, IETF RFC 1518, Sept. 1993. [9] K. Sklower, A Tree-Based Packet Routing Table for Berkeley UNIX, Proc. of the Winter 1991 UNENIX Conf., Dallas, TX, Jan. 1991. [10] V. Srinivasan and G. Varghese, Faster IP Lookups using Controlled Prefix Expansion, ACM SIGMETRICS, 1998. [11] W. Szpankowski, Patricia tries again revisited, Journal of ACM, vol. 37, no. 4. [12] Tensilica, Inc., http://www.tensilica.com. 2377