Journal of Network and Computer Applications

Journal of Network and Computer Applications () 4 Contents lists available at ScienceDirect Journal of Network and Computer Applications journal homepage: www.elsevier.com/locate/jnca Memory-efficient IP lookup using trie merging for scalable virtual routers Kun Huang a,n, Gaogang Xie a, Yanbiao Li b, Dafang Zhang b a Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China b Hunan University, Changsha, China article info Article history: Received 3 January 3 Received in revised form August 3 Accepted 4 February 4 Available online March 4 Keywords: Virtual router IP lookup Longest prefix matching Forwarding table Trie abstract Virtual routers are emerging as a promising way for network virtualization to run multiple virtual router instances in parallel on a common physical platform. The key scalability challenge for IP lookup in virtual routers is to support a large number of forwarding tables that fit in limited amounts of high-speed memory for good performance. This paper presents a novel trie-merged approach to memory-efficient IP lookup for scalable virtual routers. This approach exploits the node isomorphism to transform a forest of multiple separate tries to an equivalent directed acyclic graph (DAG). We also propose an IP lookup architecture to speed up the performance. This architecture uses an on-chip DAG to find the longest matching prefix, and then uses the prefix as a key to retrieve the corresponding next hop from off-chip hash tables. Experiments on realistic and synthetic IP forwarding tables show that the trie merging scheme reduces the number of nodes by up to 9. times as well as the memory consumption by up to. times compared to previous schemes. & 4 Elsevier Ltd. All rights reserved.. Introduction Network virtualization (Chowdhury and Boutaba, ) is recognized as a diversifying technique to support coexistence of multiple virtual networks on the same physical substrate. Virtual routers (Anwer and Feamster, ; Anwer et al., ; Sherwood et al., ; Lu et al., ) are emerging as a promising technology for network virtualization to run multiple virtual router instances in parallel on a common physical router platform. The main purpose of virtual routers is to offer fast evaluation and deployment of new protocols and algorithms in production networks. Recently, virtual routers have found wide applications in innovative network services, such as router consolidation, customerspecific routing, policy-based routing, and multi-topology routing. However, virtual routers have the scalability challenge. With the ever-growing need of virtual networks, a physical router is expected to support tens or hundreds of virtual router instances each with its own forwarding table. For example, a Juniper's logical router can be configured to support up to virtual router instances. These virtual router instances require a significant amount of memory to store all the forwarding tables. Unfortunately, due to limited amounts n Corresponding author. E-mail addresses: huangkun9@ict.ac.cn (K. Huang), xie@ict.ac.cnn (G. Xie), lybmath@hnu.edu.cn (Y. Li), dfzhang@hnu.edu.cn (D. Zhang). of high-speed memory, scaling to these numbers makes the memory usage prohibitive. On the other hand, to keep up with Gbps transition rates (Song et al., 9; Bando and Chao, ), virtual routers must require the use of memory-efficient forwarding data structures for line-rate IP lookup. In addition, due to aggregated updates from all the forwarding tables, virtual routers should support fast incremental updates. Hence, it is vital for the success of virtual routers to achieve high scalability in memory used for IP lookup. IP lookup in a single router is a well-studied problem. A forwarding table is stored in a router, which comprises a set of prefixes and their associated next hops. IP lookup is done by determining the longest prefix in the forwarding table that matches the destination IP address of an incoming packet. IP lookup solutions fall into three main categories: ternary content addressable memory (TCAM)-based, hash-based, and trie-based solutions. TCAM-based solutions (Zane et al., 3; Zheng et al., 6; Lu and Sahni, ) can provide deterministic and highspeed IP lookup in one clock cycle, but suffer from the limitations of excessive power consumption, high cost, and low density. Hashbased solutions (Song et al., 9; Bando and Chao, ; Waldvogel et al., 99; Border and Mitzenmacher, ; Dharmapurikar et al., 3; Hasan et al., 6; Yu et al., 9) have been proposed to accelerate the lookup performance, but they require prohibitive amounts of high-bandwidth memory that impedes their use in practice. Trie-based solutions (Srinivasan and Varghese, 99; Degermark et al., 99; Eatherton et al., 4; Song http://dx.doi.org/.6/j.jnca.4.. 4-4/& 4 Elsevier Ltd. All rights reserved.

4 K. Huang et al. / Journal of Network and Computer Applications () 4 et al., 9; Huang et al., ) use tries to perform longest prefix matching (LPM), where the destination address of a packet is matched against a set of IP prefixes to search the longest prefix. Pipelining (Hasan and Vijaykumar, ; Baboescu et al., ; Jiang and Prasanna, ) is used to produce one lookup result per clock cycle, improving the throughput of trie-based solutions. There are two approaches to IP lookup in virtual routers: separated and merged approach. The separated approach uses a separate data structure to represent each forwarding table, and partitions resources among all the virtual router instances (Fu and Rexford, ). This approach has the benefit of isolating the memory usage of virtual routers. However, it causes large memory consumption due to separately storing all the forwarding data structures, and suffers from the limitation of unbalanced memory allocation due to different forwarding table sizes. In contrast, the merged approach consolidates multiple forwarding data structures into a shared data structure to serve packets from different virtual router instances. This approach requires much less memory than the separated approach, at the cost of relaxing the isolation requirement. Recently, several triebased IP lookup algorithms (Fu and Rexford, ; Song et al., ; Le et al., ) have been proposed to construct a merged forwarding data structure. These algorithms use a natural but naive approach to achieve significant memory reductions. Nevertheless, these algorithms disregard the node isomorphism among multiple tries, and there are opportunities to construct a more succinct merged forwarding data structure, scaling well to a large number of forwarding tables. Hence, scalable virtual routers require a memory-efficient merged data structure for IP lookup with support for quick updates. In this paper, we propose a novel trie-merged approach to memory-efficient IP lookup for scalable virtual routers. This approach constructs a merged forwarding data structure, which requires less memory while supporting incremental updates. We observe that there exists the node isomorphism between tries so that multiple isomorphic nodes can be replaced with a single equivalent node, motivating this paper. We first use a virtual root node to transform a forest of multiple separate tries to a single merged tree, and then exploit the idea of node isomorphism to equivalently transform the tree to a succinct directed acyclic graph (DAG). DAG has the same lookup features as original individual tries but considerably reduces the memory usage of forwarding data structures for IP lookup. In addition, we also propose an IP lookup architecture to accelerate the performance. This architecture is designed by using the idea of fast/slow path separation. When performing IP lookup in this architecture, an on-chip DAG is traversed to find the longest matching prefix, and then the prefix is used as a key to retrieve the corresponding next hop from off-chip hash tables. Experiments on realistic and synthetic IP forwarding tables show that the trie merging scheme achieves significant reductions in memory. Compared to previous schemes, our scheme reduces the number of nodes by up to 9. times as well as the memory consumption by up to. times. This paper makes the following main contributions: We propose a trie-merged approach to transform a forest of multiple separate tries to an equivalent DAG which requires less memory for IP lookup. We propose an IP lookup architecture to speed up the performance, where an on-chip DAG is used to find the longest matching prefix, and off-chip hash tables are searched to retrieve the next hop. The rest of this paper is organized as follows. Section introduces the related work. In Section 3, we describe the trie merging scheme to construct a memory-efficient merged data structure for IP lookup. Section 4 reports experimental results on realistic and synthetic IP forwarding tables. Finally, Section concludes this paper.. Related work Virtual routers have recently gained much research interest in the networking community. The feature of virtual routers is to allow multiple virtual router instances to run in parallel on a single physical hardware platform. Each virtual router instance performs IP lookup on its own forwarding table. Virtual routers are regarded as a powerful technique to evaluate and deploy new routing protocols and forwarding algorithms in production networks. Before their successful applications, it is essential for virtual routers to achieve the scalability in IP lookup. Many trie-based IP lookup algorithms in a single router have been proposed over the past years. These algorithms use a trie data structure to represent a set of prefixes for performing LPM. A trie is a tree-based data structure by using the bits of prefix to direct the branching. A packet's destination address is used as a key to search the trie for a longest-matching prefix to determine the next hop. Several practical compact encoding schemes, such as Leaf-Pushed Trie (Srinivasan and Varghese, 99), Lulea Trie (Degermark et al., 99), Tree Bitmap Trie (Eatherton et al., 4), and Shape Graphs (Song et al., 9), have been proposed to achieve fast and memory-efficient IP lookup in a large-scale forwarding table. Unlike prior work that focuses on transforming a single trie to a succinct data structure, our work seeks to merge multiple separate tries into a shared data structure, achieving significant memory reductions for a large number of forwarding tables in virtual routers. Recently, several merged schemes for trie-based IP lookup in virtual routers have been proposed to reduce the memory usage. These schemes handle the following problem: given a number of separate tries, how to merge them into a shared data structure which not only minimizes the overall memory usage but also allows correct lookups into each of the separate tries. Figure illustrates a simple example of two forwarding tables and their binary tries. We can see that Table and Table contain 4 and entries, each with a prefix and an associated next hop. In Trie and Trie, each shadow node is a prefix node that represents each prefix. During the binary trie traversal, if the bit of the IP address is zero, the left child is selected; otherwise, the right one is selected. We use this example to clarify how a merged scheme works throughout the paper. Trie overlay (Fu and Rexford, ) is proposed to achieve memory-efficient IP lookup in a merged manner. This scheme is designed to combine all the prefixes among multiple forwarding tables into a single trie by exploiting a simple overlaying mechanism. To perform correct lookups, each prefix node in the merged trie maintains a bitmap, indicating which forwarding table the prefix belongs to. Figure (a) illustrates an example of the trie overlay scheme. We can see that nodes in both Trie and Trie (see Fig. ) are overlapped starting from the root nodes. This scheme works well when the separate forwarding tables have similar structures. Otherwise, simple overlaying cannot lead to any memory reduction, even significantly causing the increase of the memory usage. Trie braiding (Song et al., ) is an alternative approach. This scheme is designed to increase the overlap among multiple forwarding tables by using a child rotation mechanism. Each node swaps its left and right child nodes, and uses a braiding bit to identify the direction of the trie traversal for correct lookups. Figure (b) illustrates an example of the trie braiding scheme.

K. Huang et al. / Journal of Network and Computer Applications () 4 49 Table Trie Table Trie A a * P * Q * P P P * Q Q * P3 * Q3 * P4 B C * Q4 b * Q P3 D E d P4 Q4 F f Q c Q3 e g Q h Fig.. Two forwarding tables and their binary tries. Aa Braiding Bits, Aa P,Q Bb Cc P,Q P,Q,, Bb Cc P,Q P3 Dd E e Q3 P3,, Q3 Dd Ee P,Q P4 f F g P,Q,, f Fg P4 h Q, h Q Fig.. Illustration of previous merged schemes. (a) Trie overlay, and (b) Trie braiding. We can see that two braiding bits at each node are used to reverse the meaning of its child pointers: when a braiding bit is set to, the left pointer represents the ' branch and the right pointer represents the ' branch. Heuristic algorithms are proposed to optimally minimize the number of nodes in the merged trie. However, the reduction in the total number of nodes cannot necessarily lead to the reduction in the overall memory usage. This is because no memory is allocated to leaf nodes, and non-leaf nodes dominate the memory consumption of the trie data structure. Our experiments (see Section 4) also demonstrate this drawback of the trie braiding scheme. Set partition (Le et al., ) is proposed to achieve memory efficiency and high throughput. This scheme uses a set-bounded leaf-pushing mechanism to partition a merged forwarding table into two prefix subsets, and then uses a 3 tree data structure for representing each subset to perform IP lookup in parallel in all the pipelines. However, this scheme requires more memory due to storing duplicated prefix nodes. Additionally, one search is performed simultaneously in two trees to find a possible match, which leads to more processing times and ultimately limits the lookup throughput. In addition, TCAM-based scalable virtual routers (Luo et al., ) have been proposed to reduce the memory requirements of TCAMs. This scheme proposes completion and splitting techniques to merge multiple forwarding tables in small TCAMs, achieving good scalability. These techniques are orthogonal to ours, since we focus on SRAM-based scalable virtual routers in this paper. 3. Trie merging In this section, we describe the trie merging scheme. The idea of node isomorphism is exploited to transform multiple separate tries into a succinct merged data structure which not only performs memory-efficient IP lookup but also supports fast incremental updates. 3.. Binary trie merging We assume that there are m binary tries Trie i (i¼ m) representing m individual forwarding tables named Table i in virtual routers, each containing N i IP prefixes and associated next hops. A binary trie is a natural tree-based data structure for IP lookup. The trie data structure is used to store a set of IP prefixes, where a prefix value corresponds to a path from the root node to a prefix node that represents the prefix. We now present the binary trie merging scheme as follows. First, we use a virtual root node to combine a forest of multiple separate tries into a single merged tree, and then construct a merged forwarding table that contains the overall prefixes and next hops. Each entry in the merged table consists of a merged prefix and an associated next hop. The merged prefix is the concatenation of a forwarding table identifier Table i and each original prefix. Thereafter, we use the leaf pushing scheme (Srinivasan and Varghese, 99) to expand the merged tree, so that prefixes in internal nodes are pushed down to leaf nodes.

K. Huang et al. / Journal of Network and Computer Applications () 4 Merged Tree T T A a B C b c P P3 P Q Q D E F G d e f g P P4 Q4 Q Q3 H I h i j k Q3 Q l m Merged Table Merged * P * P3 * P * P4 * P * Q * Q4 * Q * Q * Q3 * Q * Q3 Fig. 3. Transforming two separate tries to a single merged tree. Figure 3 illustrates an example of transforming two binary tries to a single merged tree and constructing a merged table. We can see that the virtual root node is used to combine two tries Trie and Trie (see Fig. ) into a merged tree. Note that Trie and Trie are leaf-pushed binary tries, where each leaf node is a prefix node representing the extended prefix. Each entry in the merged table is uniquely identified by the merged prefix. Each merged prefix comprises a forwarding table identifier and an original prefix. We set the table identifiers Table and Table in Fig. 3 to one bit and, respectively. Generally, when there are m forwarding tables, the maximum length of a merged prefix islþlog m, where L is the prefix length, e.g. 3 for IPv4 and for IPv6, and log m is the bit size of the forwarding table identifier. Second, we use the idea of node isomorphism to transform the merged tree above to an equivalent binary directed acyclic graph (DAG). The truth lies in the fact that there are several isomorphic equivalent nodes in a merged tree. Two nodes in a tree are said to be isomorphic and equivalent if and only if both their left children nodes and right children nodes have the same level identifiers. We set a level identifier to each node in a bottom-up manner. As each leaf node has no children node, all the leaf nodes are isomorphic and equivalent. Starting from leaf nodes, we set one as the level identifier to each leaf nodes, and then continue to their parent nodes until to the virtual root node. A group of isomorphic equivalent nodes in a tree have the same level identifier since their children nodes have the same level identifiers. Thereafter, we can construct a new vertex in DAG to correspond to a group of isomorphic equivalent nodes in the merged tree, in order to reduce the memory usage. Note that the level identifier of isomorphic equivalent nodes is set as the identifier of a unique vertex added into DAG. Figure 4 illustrates an example of transforming a merged tree to a DAG. We first identify and label all isomorphic equivalent nodes of the merged tree in a bottom-up manner, starting from leaf nodes to the root node. The isomorphic equivalent nodes in the tree are labeled as NID-LID, where NID is a node identifier and LID is a level identifier. Due to leaf pushing, all the leaf nodes (e.g. nodes D to I and nodes d to m) with no children node are isomorphic and equivalent, such that they are grouped at level and labeled as NID- (e.g. nodes D- to I-). Thus, shaded vertex called prefix vertex in DAG is constructed to substitute the leaf nodes with level identifier LID¼. Similarly, we continue to identify and label other isomorphic equivalent nodes of the merged tree in a bottom-up manner, and then construct corresponding DAG vertices to substitute each group of isomorphic equivalent nodes at the same level. For instance, DAG vertices are constructed to replace the isomorphic equivalent nodes at levels in the tree, respectively. Note that we set the level identifiers in a left-to-right manner when the conflicts of the level identification occur. For instance, we set level identifier 3 to left node C, and set level identifier 4 to right node b, as shown in Fig. 4. In addition, each directed edge in DAG is used to represent the corresponding children pointer in the merged tree. From Fig. 4, we observe that root vertex in DAG has two directed edges pointing to vertices 6 and, corresponding to the root nodes of Trie and Trie. Figure 4 also shows vertex data structures of binary DAG. Each non-root vertex in DAG has three-tuple fields consisting of the identifiers of left and right children vertices and a next hop bitmap named NH-BMP. NH-BMP indicates whether the left or right children vertex is a prefix vertex. If the left or right children vertex is a prefix vertex, the left or right bit in NH-BMP is set to ; otherwise, the corresponding bit is set to. We observe that each children vertex identifier has at least the size of log n bits, and the next hop bitmap has the size of bits in the binary DAG, where n is the total number of vertices in DAG. Thus, each vertex requires at least log nþ bits, and the overall DAG contains n vertices that require the memory usage of n ( log nþ) bits. Therefore, DAG requires less memory than multiple separate tries since DAG has a smaller number of vertices each with smaller size. 3.. Lookup in the binary DAG To keep up with line rates, we leverage the idea of the fast/slow path separation to accelerate the lookup performance. We propose an IP lookup architecture, where a DAG is stored in fast on-chip SRAM, and hash tables are stored in slow off-chip DRAM. Figure illustrates an IP lookup architecture using trie merging. Each vertex in on-chip DAG uses a next hop bitmap NH-BMP to replace the next hop pointers, which assists to reduce the on-chip memory usage. Each off-chip hash table contains the overall prefixes and next hops of a separate forwarding table. Using this architecture, the IP lookup process works as follows. When searching an IP address of an incoming packet with a forwarding table identifier, we first traverse the on-chip DAG to find the longest matching prefix on the IP address, and then use the forwarding table identifier as index to select one off-chip hash

K. Huang et al. / Journal of Network and Computer Applications () 4 Merged Tree DAG - T T A-6 a- B- C-3 b-4 c- P P3 P Q Q D- E- F- G- d- e- f- g-3 P P4 Q4 Q Q3 H- I- h- i- j- k- T T 6 3 4 Addr 3 4 6 Left ID Right ID NH-BMP -- -- 3 3 4 T Ptr T Ptr 6 Q3 Q l- m- Fig. 4. Transforming a merged tree to a DAG and vertex data structures of DAG. 6 T DAG 3 4 On-chip SRAM table. Finally, the prefix is used as a key by hashing to retrieve the corresponding next hop from the off-chip hash table. We show an example of IP lookup in the binary DAG as seen in Fig.. Suppose that we need to search an IP address of an incoming packet with a forwarding table identifier Table. As Table is, the search starts with root node, and continues to vertex 6. For the first bit of the IP address, vertex 6 checks to see that the right bit in its NH-BMP is (see Fig. 4), and then continues to right children vertex 3. For the second bit, vertex 3 checks to see that the left bit in its NH-BMP is, and then continues to left children vertex. For the third bit, vertex checks to see that the right bit in its NH-BMP is, indicating that its right children vertex is a prefix vertex, and then the search terminates, producing the longest matching prefix n. We use Table to select the first offchip hash table, and then hash by computing Hash() to retrieve the corresponding next hop P4 from the hash table. Finally, the packet is forwarded to P4. 3.3. Multi-bit trie merging T Hash (Prefix) Hash Table * P4 * P * P * P * P3 Hash Table * Q * Q3 * Q * Q * Q3 * Q4 * Q Off-chip DRAM Fig.. IP lookup architecture using trie merging. In practice, a multi-bit trie with a stride of s is widely used to boost the lookup throughput of a binary trie. A multi-bit trie is defined to allow the inspection of s bits at a time, where each node has s children nodes. A multi-bit trie can reduce the total number of nodes since it consumes s bits at a time. However, the node size in the multi-bit trie increases exponentially with the stride size, which leads to a rapid increase of memory usage. When the stride size is too large, the increase in node size can outpace the reduction in the number of nodes of the multi-bit trie. In recent years, several optimization techniques (Srinivasan and Varghese, 99; Degermark et al., 99) have been proposed to minimize the memory usage of the multi-bit trie while improving the lookup throughput. We propose a multi-bit trie merging scheme in a manner similar to the binary trie merging one above. By exploiting the idea of node isomorphism, this scheme transforms a forest of multiple multi-bit tries to a merged tree to a succinct multi-bit DAG. In practice, a multi-bit DAG can be derived from a binary DAG. The construction procedure works as follows: we assume that there is a binary DAG and a stride of size s. Starting from the starting vertex of each trie, we recursively walk the binary DAG by using each of the number s of s-bit sub-strings. Each unique reached vertex is added into a multi-bit DAG, and then is connected to other vertices with s s-bit directed edges. The above-mentioned step is repeated for each of the unique vertices added into the multi-bit DAG. The process terminates until no unique vertex is reached in the binary DAG. Figure 6 illustrates an example of -stride DAG derived from the binary DAG in Fig. 4. Root vertex in the binary DAG corresponds to root vertex ' in the -stride DAG, and starting vertices 6 and of two tries correspond to starting vertices ' and 3'. We start with starting vertices 6 and to separately walk through the binary DAG. Vertex 6 reads -bit sub-strings,, and to reach vertex, while reading sub-string to reach vertex. Thus, vertices 4' and ' are added into the -stride DAG, and are connected to vertex ' with four directed edges labeled by substrings,,,. Vertex reads sub-strings and to reach vertex, reads sub-string to reach vertex, while reading substring to reach vertex 3. Since vertex 3 is unique, vertex 6' is added into the -stride DAG. Finally, we construct the -stride DAG with six vertices. However, the multi-bit trie merging scheme has another challenge of prefix expansion, causing the memory inefficiency. This reason lies in the fact that a multi-bit DAG is essentially transformed from multiple separate multi-bit leaf-pushed tries. The number of prefix contained by the multi-bit DAG may expand about.6 times on average after leaf pushing (Srinivasan and Varghese, 99). For instance, as shown in Fig. 6, vertex ' in the multi-bit DAG can match four expanded prefixes n, n, n, and n, while the corresponding vertex in the binary DAG can match only two original prefixes n and n. Hence, the redundant expanded prefixes require more off-chip memory since all the prefix pairs {prefix, next hop} are stored in off-chip hash tables.

K. Huang et al. / Journal of Network and Computer Applications () 4 T T T ' T 6 3 4 ' ' 6' 4' 3' Fig. 6. Deriving a -stride DAG from a binary DAG. Mask Code ' T T ' 3' ' 6' 4' Addr ' T Ptr T Ptr ' 3' Chd ID Chd ID Chd ID Chd3 ID MC-BMP NH-BMP ' 4' 4' ' 4' 3' 4' ' 4' 6' 4' -- -- -- -- ' 4' 4' 4' 4' 6' 4' 4' 4' 4' Fig.. Mask code bitmaps and vertex data structures of -stride DAG. We propose a simple technique called mask code bitmap to eliminate the prefix expansion. Given an s-stride DAG, each vertex needs to maintain a mask code bitmap of s bits, each bit indicating which group of its outgoing edges belongs to a same original prefix. The consecutive same bits in a mask code bitmap indicate that these corresponding outgoing edges belong to a same original prefix. The purpose of using this bitmap is to infer the actual length of the expanded prefixes, which assists to reduce the number of prefixes stored in off-chip memory. Figure shows an example of mask code bitmaps in a -stride DAG. We can see that each vertex except to root vertex ' contains a mask code bitmap of four bits since it has four outgoing edges. For instance, the mask code bitmap of vertex ' is, indicating that the first two edges belong to a same original prefix and the last two edges belong to another same original prefix. Vertex ' has a mask code bitmap of, indicating that the first two edges belong to different original prefixes, the third edge is not a matching prefix, and the fourth edge belongs to an original prefix. Figure also shows vertex data structures of -stride DAG. Each vertex has six-tuple fields consisting of the identifiers of four child vertices, a mask code bitmap named MC-BMP, and a next hop bitmap named NH-BMP. MC-BMP and NH-BMP have the size of four bits due to the stride of bits. The IP lookup algorithm in a multi-bit DAG is shown in Algorithm. Using this algorithm we show an example of IP lookup in the -stride DAG as seen in Fig.. Suppose that we need to search an IP address of an incoming packet with a forwarding table identifier Table. The search starts with root vertex ', and then continues to starting vertex ' of Trie. For the first two bits of the IP address, vertex ' checks its NH-BMP, and then continues to its child vertex '. For the next two bits, vertex 'checks its NH-BMP to find that its child vertex 4' is a prefix vertex, and then checks its MC-BMP to calculate the number of consecutive one-value bits around the location () ¼asr¼. Next, we calculate the actual length of matching prefix as (l log r)¼(4 log )¼3 bits, where l is the length of matching prefix n, and r is the number of consecutive same-value bits in MC-BMP. Finally, we produce the longest matching prefix n, and then hash to search the first off-chip hash table, returning associated next hop P4. Algorithm. IP lookup in multi-bit DAG. Search Multi-Bit DAG (ip_addr, vid) dag is a instance of the multi-bit DAG; 3 stride_size is the stride size; 4 root_node = dag.getroot(); current_node = root_node.getsubroot(vid); 6 for (i = ; i o= len(ip_addr)/ stride_size; i++) do /* get the sub-string of ip address */ 9 3 4 sub_str = ip_addr.getsubstr(i, s); j = calculatechildindex(sub_str); /* it reaches the matching node of the DAG */ if (current_node.nexthopbmp[j] == ) then /* popcount the consecutive bits in mask code bitmap */ mask_count = current_node.popcountmaskcode(j); /* calculate the valid length of sub-string prefix */ valid_len = stride_size log(mask_count); match_len = (i-)*stride_size + valid_len; return match_len; else /* it continues to search a child node */

K. Huang et al. / Journal of Network and Computer Applications () 4 3 child_node = current_node.getchildnode(j); 6 current_node = child_node; end if end for 3.4. Constructing efficient hash tables In our IP lookup architecture, off-chip hash tables are used to retrieve the next hop corresponding to the longest matching prefix. The performance of off-chip hash tables has significant impact on the IP lookup throughput as well as the off-chip memory usage. There have been several research studies (Border and Mitzenmacher, ; Song et al., ; Kumar et al., ) on implementing efficient hash tables with compact storage and low collision probability. In this paper, we use a simple and efficient mechanism to construct off-chip hash tables proposed in (Song et al., 9; Dharmapurikar et al., 3). Each hash table is composed of an array of hash sub-tables organized by the prefix length. We partition all the leaf-pushed prefixes of a forwarding table into an array of subsets according to the prefix length. Each subset of prefix pairs {prefix, next hop} is stored in a hash sub-table. Each hash sub-table uses a single hash function to perform the lookup, which takes just one access to off-chip memory. Figure illustrates an example of off-chip hash tables for two forwarding tables. We can see that there are two hash sub-tables for Table and three hash sub-tables for Table. Suppose that we use an onchip DAG to produce the longest matching prefix n with a forwarding table identifier Table. We use Table to select the first array of off-chip hash table, and then use the prefix length of n ¼3 as index to select the second hash sub-table for the search. Finally, we hash by computing Hash() to search the hash sub-table, returning the corresponding next hop P4 for packet forwarding. However, hash collisions and bucket overflow may cause serious performance degradation of a hash table. We can use a multiple-choice hashing scheme to optimize the performance of off-chip hash tables similar to previous schemes (Border and Mitzenmacher, ; Song et al., ; Kumar et al., ). This scheme has kz hash functions, and maps each prefix into k candidate buckets, only one of which with the lowest load is selected to store a prefix pair {prefix, next hop}. One prefix lookup needs to access k buckets using the same k hash functions, and all the prefix pairs stored in these buckets must be searched to find the match. To simplify the design, we use -choice hashing scheme for the performance evaluation, such that two memory accesses are taken to perform one lookup. To accelerate the offchip hash table lookup, we may also use multi-port memories or multiple parallel memory modules to improve the worst-case performance of off-chip hash tables. 3.. Incremental updates of DAG Network topology changes or transient router failure may lead to frequent updates of forwarding tables. The control plane in a router computes new prefixes and next hops to update the forwarding data structure in the data plane. As the updates are aggregated from all the forwarding tables, it is essential for scalable virtual routers to support fast incremental updates, guaranteeing non-stop IP lookup with correct forwarding. In this paper, we use backup leaf-pushed tries to update the off-chip hash tables and the on-chip DAG. Each backup trie refers to the overall prefixes of a forwarding table. An update can be any of three changes of prefixes and next hops: () the change of a next hop; () the deletion of an existing prefix; (3) the insertion of a new prefix and next hop. For the next hop change, we just need to modify the off-chip hash table entry corresponding to the prefix. For the prefix deletion, we just use the next hop of its sibling or the default next hop in the backup trie to update the next hop associated with the prefix in an off-chip hash table. Such two updates can simplify the operation and save memory access, without modifying the on-chip DAG and interrupting IP lookup. Hash Table * P4 * P * P * P * P3 Length 3 Hash Hash Sub-Tables * P * P * P3 * P4 * P Hash Table * Q * Q3 * Q * Q * Q3 * Q4 * Q Length 3 4 Hash Hash Sub-Tables * Q * Q * Q3 * Q * Q4 * Q * Q3 Fig.. Off-chip hash sub-tables for storing the prefix pairs.

4 K. Huang et al. / Journal of Network and Computer Applications () 4 T T T T 6 3 9 4 P P3 Q Q 3 P P4 Q4 Q Q3 9 6 3 4 P P Q3 Q Fig. 9. Incremental updates of DAG. The prefix insertion is complicated. We propose a reverse-path scheme to achieve fast prefix insertion of DAG. Assume that a prefix with a next hop is inserted into a forwarding table. First, the prefix is expanded using leaf pushing in the backup trie. Second, we re-label the level identifier of each node along the reverse paths from new leaf nodes to the root nodes in the backup trie. If there is a new level identifier, a new vertex is added into DAG; otherwise, it continues to its parent node. Finally, new directed edges are added between new vertices in DAG, and new leaf-pushed prefixes are added or updated in the off-chip hash tables. Hence, one update of DAG requires at most O(d) memory accesses, where d is the depth of the leaf-pushed trie, achieving quick updates. Figure 9 illustrates an example of incremental updates of DAG. There is a new prefix n with a next hop P inserted into the forwarding table Table. We first expand the prefix n in the backup trie by leaf pushing, and then generate two leaf nodes corresponding to {n, P} and {n, P} respectively. Thereafter, we check the level identifiers of all the nodes along the reverse paths from two leaf nodes to the root node. As both leaf nodes have the level identifier, their parent node has to alter the level identifier from to. Similarly, we repeat the process to alter the level identifiers of ancestor nodes until to the root node. Finally, two new vertices 9 and are added into DAG, and new directed edges are added to connect them. In addition, vertex 6 is removed because it is not used by the lookup. 4. Experimental results We have conducted simulation experiments to evaluate the trie merging scheme. We implement an IP lookup simulator for virtual routers using C/Cþþ, and run the simulations on a server with.4 GHz 64-bit Intel Xeon E64 CPU and.gb main memory. Since the memory efficiency is more critical for scalable virtual routers, we only show results to examine the scalability in memory used for IP lookup. Thus, these experiments compare our scheme with previous schemes, such as trie overlay and trie braiding, in termsofthenumberofnodesaswellasthememoryconsumption. For evaluation purposes, we obtained six representative realistic forwarding tables of core BGP routers from two public databases. AS644, AS6, and AS are collected from ( http://bgp. potaroo.net ), while V3 and V6 are collected from ( http://www. routereviews.org ). AS644, AS6, OIX, and V3 are IPv4 forwarding tables that contain about 3K, K, 34K, and K prefixes, respectively. AS and V6 are IPv6 forwarding tables that contain about.k and.9k prefixes, respectively. Table shows Table Realistic IPv4 and IPv6 forwarding tables. Database IPv4 IPv6 AS644 AS6 OIX V3 AS V6 #Prefixes 3,344,9 34,4,36 9 99 #Strides #Nodes #Nodes 9,46 6, 44,4 9, 36 39 3 3,3,3 4,3, 63 3 4,34 66,69 9,, 44 6 6,444 9,9,6 66, 96 the number of prefixes of these realistic IPv4 and IPv6 forwarding tables, and the number of nodes under various strides of from to before leaf pushing. Due to the lack of realistic virtual router forwarding tables, we randomly partition each forwarding table into m (rmr) equal-sized sub-tables for m virtual router instances to conduct the performance evaluation. In addition, we use a toolkit called FRuG (Ganegedara et al., ) to generate large-scale synthetic IPv6 forwarding tables by using IPv4-to-IPv6 one-to-one mapping (Le et al., ). Table shows synthetic IPv6 forwarding tables each with about K prefixes used in the experiments. In our experiments, a k-stride trie (k¼ and 4) is used to represent each forwarding table in virtual routes. Next, we show results on IPv4 and IPv6 forwarding tables. 4.. Experiments on IPv4 forwarding tables We have conducted the experiments on four realistic IPv4 forwarding tables AS644, AS6, OIX, and V3 to examine the memory efficiency in various settings of 4 strides. Figure depicts the number of nodes on AS644 and AS6, while Fig. depicts the number of nodes on OIX and V3. From the figures, we can see that our scheme achieves significant reductions in the number of nodes of realistic IPv4 forwarding tables. This gain increases with the increase of the stride size due to the fact that there are more isomorphic equivalent nodes among multiple tries. For instance, in -stride settings, the trie merging scheme reduces the number of nodes by. 9. times and.3 9. times compared to the trie overlay and trie braiding schemes, respectively. In 4-stride settings, the trie merging scheme reduces the number of nodes by 46.. times and 46. 9. times.

K. Huang et al. / Journal of Network and Computer Applications () 4 Table Synthetic IPv6 forwarding tables. Database IPv6 Table- Table- Table-3 Table-4 Table- Table-6 Table- Table- Table-9 Table- #Prefixes,33,,93,36,649,36,69,63,4, #Tries -Stride 4-Stride 44 9 4 696 946 6 6 349 333 644 93 66 6 4 6 4 #Tries -Stride 4-Stride 36 4 494 33 6 9 44 633 64 33 3 3 3 4 6 4 6 Fig.. Number of nodes on realistic IPv4 tables (a) AS644 and (b) AS6. 6 4 #Tries -Stride 4-Stride 43 49 4 636 93 6 639 44 9 6 96 9 6 4 #Tries -Stride 4-Stride 344 4 3649 6 6 6494 996 39 33 4 3 3 4 6 4 6 Fig.. Number of nodes on realistic IPv4 tables (a) OIX and (b) V3. Memory Consumption (Mbits) 4 4 3 3 33. 33.3 33.3 33. 33. 3. 3. 3. 3. 3. 9. 9.9 9. 9. 9. 9. 9.....6. 4.4 4.64.6.4. 6.4 6.4.3 Memory Consumption (Mbits) 4.4 4. 4.4 4. 4.4 3. 4.4 4.4..3.99 6.6 6. 6. 6.6 6.4.3.3.3.3.3.3 3.3 3.3 3.9 4.9.9 4.3 4.9.6 4 6 4 6 4 4 3 3 Fig.. Memory consumption on realistic IPv4 tables (a) AS644 and (b) AS6.

6 K. Huang et al. / Journal of Network and Computer Applications () 4 Memory Consumption (Mbits) 6 4 4 3 3 4 6 4 6 Memory Consumption (Mbits) 4 4 3 3 Fig. 3. Memory consumption on realistic IPv4 tables (a) OIX and (b) V3., 6,, 4, 3,, #Tries -Stride 4-Stride 9 3 4 46 3 6 9 33 6 36 3 3,,,,, #Tries -Stride 4-Stride 93 3 4 39 34 6 6 6 6 69 9 3,, 4 6 4 6 Fig. 4. Number of nodes on realistic IPv6 tables (a) AS and (b) V6. Memory Consumption (Kbits) 4 6 4. 3. 3..4 3. 3. 3. 4. 3.3 3. 44.3 43.6 44.3 44.3 44.3 44.3 399.3 3. 36. 3..4 9.4 94...4 3. 46.9. 6.3. Memory Consumption (Kbits) 4.. 49. 46. 4.3 93. 93. 93. 93. 93..4..4.4.4.4 644 633. 6.9 66. 9.4 44.4 6.9 99.9 3. 3.9.6 6.3 34.4 46. 4 6 4 6 4 6 4 6 4 Fig.. Memory consumption on realistic IPv6 tables (a) AS and (b) V6. Figure depicts the memory consumption on AS644 and AS6, while Fig. 3 depicts the memory consumption on OIX and V3. Both figures show that our scheme requires less memory than previous schemes for different strides on realistic IPv4 forwarding tables. This gain also increases with the increase of the stride size. In -stride settings, the trie merging scheme reduces the memory consumption by 3.. times and.6.3 times compared to the trie overlay and trie braiding schemes, respectively. In 4-stride settings, the trie merging scheme reduces the memory consumption by 4.. times and 4.. times. 4.. Experiments on IPv6 forwarding tables We have conducted the experiments on two realistic IPv6 forwarding tables AS and V6, and synthetic IPv6 forwarding tables to examine the memory efficiency in various settings of 4 strides. Figure 4 depicts the number of nodes on AS and V6. We can see that our scheme significantly reduces the number of nodes of realistic IPv6 forwarding tables for different strides. In -stride settings, the trie merging scheme reduces the number of nodes by..3 times and..9 times compared to the trie overlay and

K. Huang et al. / Journal of Network and Computer Applications () 4 9 6 4 3 3-Stride Trie Overlay 3-Stride Trie Braiding 3-Stride Trie Merging #Tries -Stride 3-Stride 33 4 9 96 6 6 663 3 3 4 44 3.9 3.4. 9..9. 4.6.9.9. 4. 3.9..6.6 4.6 43.3 4. 36.9 4 6 4 6 6 4 6 4 3-Stride Trie Overlay 3-Stride Trie Braiding 3-Stride Trie Merging..9 4.3. 9... 3.9 3. Fig. 6. Memory comparisons on synthetic IPv6 tables. (a) Number of nodes, (b) Memory consumption. trie braiding schemes, respectively. In 4-stride settings, the trie merging scheme reduces the number of nodes by 3.6. times and.. times. We also show that the gain on IPv6 forwarding tables is smaller than that on IPv4 forwarding tables. This is because there are less isomorphic equivalent nodes among multiple IPv6 tries than IPv4 tries. Figure depicts the memory consumption on AS and V6, and shows that our scheme requires less memory than previous schemes on realistic IPv6 forwarding tables. In addition, the gain also increases with the increase of the stride size. For instance, in - stride settings, the trie merging scheme reduces the memory consumption by.6. times and. 4.9 times compared to the trie overlay and trie braiding schemes, respectively. In 4-stride settings, the trie merging scheme reduces the memory consumption by 3. 6. times and 3. 6.4 times, respectively. Figure 6 depicts the memory comparisons on synthetic IPv6 forwarding tables. From the figure, we also demonstrate that our scheme achieves significant reductions in the number of nodes as well as the memory consumption. As shown in Fig. 6(a), in -stride settings, the trie merging scheme reduces the number of nodes by.. times compared to both previous schemes; in 3-stride settings, the trie merging scheme reduces the number of nodes by. 3.4 times. Figure 6(b) also shows that our scheme requires less memory, and the gain increases with the increase of the stride size. In -stride settings, the trie merging scheme reduces the memory consumption by 3. 4.4 times and 3.3 4.6 times compared to the trie overlay and trie braiding schemes, respectively. In 3-stride settings, the trie merging scheme reduces the memory consumption by 3..4 times and 3.4.6 times, respectively.. Conclusion This paper presents a trie-merged approach to memory-efficient IP lookup for scalable virtual routers. The goal is to scale well to a large number of simultaneous forwarding tables by constructing a succinct shared data structure. This approach transforms a forest of multiple separate tries to an equivalent DAG by exploiting the node isomorphism. We show that DAG requires less memory while performing same correct lookup. In addition, we propose an IP lookup architecture, where an on-chip DAG is searched to find the longest matching prefix, and then the prefix is used as a key to retrieve the corresponding next hop from off-chip hash tables. Experiments on realistic and synthetic IP forwarding tables demonstrate that the trie merging scheme achieves memory efficiency, and requires less memory than previous schemes. Results on IPv4 forwarding tables show that our scheme reduces the number of nodes by up to. times and up to 9. times as well as the memory consumption by up to. times and up to. times comparedtothetrieoverlayandtriebraidingschemes,respectively. Results on IPv6 forwarding tables show that our scheme reduces the number of nodes by up to. times as well as the memory consumption by up to 6.4 times compared to previous schemes. Acknowledgment This work was supported in part by the National Basic Research Program of China under Grant no. CB3, and the National Science Foundation of China under Grant nos. 6 and 636, and the China Postdoctoral Science Foundation under Grant no. 4. References Anwer MB, Feamster N. Building a fast, virtualized data plane with programmable hardware. ACM SIGCOMM Comput Commun Rev ;4():. Anwer MB, Motiwala M, Tariq M, Feamster N. SwitchBlade: a platform for rapid deployment of network protocols on programmable hardware. In: ACM SIGCOMM;. p. 3 94. Baboescu, F, Tullsen, DM, Rosu, G, Singh, S. A tree based router search engine architecture with single port memories. In: ISCA;. p. 3 33. Bando M, Chao HJ. FlashTrie: hash-based prefrix-compressed trie for IP route lookup beyond Gbps. In: IEEE INFOCOM;. p. 9. Border A, Mitzenmacher M. Using multiple hash functions to improve IP lookups. In: IEEE INFOCOM;. p. 44 63. Chowdhury NMK, Boutaba R. A survey of network virtualization. Comput Netw ;4():6 6. Degermark M, Brodnik A, Carlsson S, Pink S. Small forwarding tables for fast routing lookups. In: ACM SIGCOMM; 99. p. 3 4. Dharmapurikar S, Krishnamurthy P, Taylor D. Longest prefix matching using Bloom filters. In: ACM SIGCOMM; 3. p.. Eatherton W, Varghese G, Dittia Z. Tree bitmap: hardware/software IP lookups with incremental updates. SIGCOMM Comput Commun Rev 4; 34():9. Fu J, Rexford J. Efficient IP address lookup with a shared forwarding table for multiple virtual routers. In: ACM CoNEXT;. Ganegedara T, Jiang W, Prasanna VK. Frug: a benchmark for packet forwarding in future networks. In: IEEE IPCCC;. Hasan J, Vijaykumar TN. Dynamic pipelining: making IP lookup truly scalable. In: ACM SIGCOMM;. p. 6. Hasan J, Cadambi S, Jakkula V, Chakradhar S. Chisel: a storage-efficient collisionfree hash-based network processing architecture. In: ISCA; 6. p. 3. Huang K, Xie G, Li Y, Liu A. Offset address approach to memory-efficient IP address lookup. In: IEEE INFOCOM;. p. 36. Jiang W, Prasanna VK. Beyond TCAM: an SRAM-based multi-pipeline architecture for terabit IP lookup. In: IEEE INFOCOM;. p. 6 94. Kumar S, Turner J, Crowley P. Peacock hashing: deterministic and updatable hashing for high performance networking. In: IEEE INFOCOM;. p.. Le H, Ganegedara T, Prasanna VK. Memory-efficient and scalable virtual routers using FPGA. In: FPGA;. p. 66.

K. Huang et al. / Journal of Network and Computer Applications () 4 Lu G, Guo C, Li Y, Zhou Z, Yuan T, Wu H, et al. ServerSwitch: a programmable and high performance platform for data center networks. In: NSDI;. Lu W, Sahni S. Low power TCAMs for very large forwarding tables. IEEE Trans Netw ;(3):94 9. Luo L, Xie G, Uhlig S, Mathy L, Salamatian K, Xie Y. Towards tcam-based scalable virtual routers. In: ACM CoNext;. p. 3 4. Sherwood R, Gibb G, Yap K-K, Appenzeller G, Casado M, McKeown N, et al. Can the production network be the testbed? In: OSDI;. Song H, Dharmapurikar S, Turner J, Lockwood J. Fast hash table lookup using extended Bloom filter: an aid to network processing. In: ACM SIGCOMM;. p. 9. Song H, Hao F, Kodialam M, Lakshman TV. IPv6 lookups using distributed and load balanced Bloom filters for Gbps core router line cards. In: IEEE INFOCOM; 9. p. 6. Song H, Kodialam M, Hao F, Lakshman TV. Scalable IP lookups using shape graphs. In: IEEE ICNP; 9. p. 3. Song H, Kodialam M, Hao F, Lakshman TV. Building scalable virtual routers with trie braiding. In: IEEE INFOCOM;. p. 44. Srinivasan V, Varghese G. Fast address lookups using controlled prefix expansion. In: ACM SIGMETRICS; 99. p.. Waldvogel JTM, Varghese G, Plattner B. Scalable high speed IP routing lookups. In: ACM SIGCOMM; 99. p. 36. Yu H, Mahapatra R, Bhuyan L. A hash-based scalable IP lookup using Bloom and fingerprint filters. In: IEEE ICNP; 9. p. 64 3. Zane F, Narlikar G, Basu A. CoolCAMs: power-efficient TCAMs for forwarding engines. In: IEEE INFOCOM; 3. p. 4. BGP table. http://bgp.potaroo.net. Route reviews project. http://www.routereviews.org. Zheng K, Hu C, Lu H, Liu B. A TCAM-based distributed parallel IP lookup scheme and performance analysis. IEEE/ACM Trans Netw 6;4(4):63.