On Adding Bloom Filters to Longest Prefix Matching Algorithms

Size: px

Start display at page:

Download "On Adding Bloom Filters to Longest Prefix Matching Algorithms"

Martina Hall
5 years ago
Views:

1 1 On Adding Bloom Filters to Longest Prefix Matching Algorithms Hyesook Lim, Member, IEEE, Kyuhee Lim, Nara Lee, and Kyong-hye Park, Student Members, IEEE Abstract High speed IP address lookup is essential to achieve wirespeed packet forwarding in Internet routers. Ternary content addressable memory (TCAM) technology has been adopted to solve the IP address lookup problem because of its ability to perform fast parallel matching. However, the applicability of TCAMs presents difficulties due to cost and power dissipation issues. Various algorithms and hardware architectures have been proposed to perform the IP address lookup using ordinary memories such as SRAMs or DRAMs without using TCAMs. Among the algorithms, we focus on two efficient algorithms providing high-speed IP address lookup; parallel multiple-hashing algorithm and binary search on level algorithm. This paper shows how effectively an on-chip Bloom filter can improve those algorithms. A performance evaluation using actual backbone routing data with 15, ,000 prefixes shows that by adding a Bloom filter, the complicated hardware for parallel access is removed without search performance penalty in parallel-multiple hashing algorithm. Search speed has been improved by 30-40% by adding a Bloom filter in binary search on level algorithm. Index Terms Internet, router, IP address lookup, longest prefix matching, Bloom filter, multi-hashing, binary search on levels, leaf pushing. 1 INTRODUCTION ADDRESS lookup determines an output port using the destination IP address of an incoming packet. The address aggregation technology currently used for the Internet is a bitwise prefix matching scheme called classless inter-domain routing (CIDR), which uses variable-length subnet masking to allow arbitrary-length prefixes. An IP address is said to match a prefix if the most significant l bits of the address and a l-bit prefix are the same. When an IP address matches more than one prefix, the longest matching prefix is selected as the best matching prefix (BMP) [1]-[4]. IP address lookup is one of the most challenging operations in router design because of the amount of traffic and the number of networks, which have increased dramatically in recent years. Using application-specific integrated circuits (ASICs) with off-chip ternary content addressable memories (TCAMs) has been the best solution to provide the wire-speed packet forwarding. However, TCAMs have some limitations [5]. TCAMs consume 150 times more Manuscript received 5 Nov. 2011; Authors are with the Department of Electronics Engineering, Ewha W. University, Seoul, Korea ( hlim@ewha.ac.kr). power per bit than SRAMs. TCAMs consume around 30-40% of the total line card power. As line cards are stacked together, TCAMs impose a high cost on the cooling system. System vendors are willing to accept some latency penalty if the power of a line card can be lowered [6]. TCAMs also cost about 30 times more per bit of storage than DDR SRAMs. Various algorithms have been studied to replace TCAMs with ordinary memories such as SRAMs or DRAMs [1]-[4], [6]-[26]. A fast on-chip SRAM is often used in several applications, so that critical data is stored there with a guaranteed fast access time [27] since an access to offchip memory (usually DRAM) requires longer access time, which is times slower than on-chip memory access. It is important to partition properly so that a small part of data is stored into on-chip memories and most of the data is stored in slower and higher capacity off-chip memories. Several metrics are used for evaluating the performance of IP address lookup algorithms and architectures. Since IP address lookup should be performed at wire-speed for every incoming packet, which can be a hundred million packets per second, search performance is the most important metric. Search performance is often measured by the number of off-chip memory accesses. The next metric is the required memory size for storing a routing table. The incremental update of a routing table is also an important metric. Scalability for large routing data sets and migration to IPv6 should also be considered. The performance in these metrics depends on data structures and search algorithms, and thus, it is essential to have an efficient structure and a search algorithm to provide the high-performance IP address lookup. IP address lookup algorithms can be roughly categorized by trie(or tree)-based algorithms [2]-[4], [8], [13]- [26], hashing-based algorithms [9]-[11], or bitmap-based algorithms [6], [12]. Recently, dynamic programmingbased approaches have been proposed to improve the search performance and/or storage performance [15]- [21]. Hashing is a well-defined procedure for turning each key into a smaller integer called a hash index, which serves as a pointer into an array. Hashing has been Digital Object Indentifier /TC /12/$ IEEE

2 2 used mostly in search algorithms to quickly locate a data record for a given search key. For the IP address lookup, hashing is applied to each length of prefixes, and the longest prefix among matched prefixes is selected as the best match [9]-[11]. Among trie-based algorithms, binary searching on hash tables organized by prefix lengths [13]-[15] provides the best IP address lookup performance. Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Bloom filter has been popularly applied to network algorithms [7], [26], [28]-[31]. This paper shows how effectively an on-chip Bloom filter can improve the search performance of known efficient IP address lookup algorithms. This paper is organized as follows. Section 2 describes the Bloom filter theory. Section 3 introduces two different algorithms providing high-speed IP address lookup; parallel multiple-hashing, and binary search on levels. Section 4 describes our proposed method to improve those algorithms using a Bloom filter. Section 5 shows performance evaluation results, and Section 6 concludes the paper. 2 BLOOM FILTER THEORY A Bloom filter is basically a bit-vector used to represent the membership information of a set of elements. A Bloom filter that represents a set S = {x 1,x 2,,x n of n elements is described by an array of m bits, initially all set to 0. Bloom filter supports two different operations; programming and querying. In programming, for an element x in the set S, k different hash functions are computed in such a way that the resulting hash index h i (x) is of the range 0 h i (x) m for i =1,,k. Then all the bit-locations corresponding to k hash indices are set as 1 in the Bloom filter. The pseudo-code to program a Bloom filter for an element x is as follows [7]: BFProgramming (x) for ( i =1tok ) BF[h i (x)] =1; A querying is performed to test whether an element y S. For an input y, k hash indices are generated using the same hash functions that were used to program the filter. The bit-locations in the Bloom filter corresponding to the hash indices are checked. If at least any one of the location was 0, then it is absolutely not a member of the set S, and it is termed as negative. If all the hash index locations were set as 1, then the input may be a member of the set, and it is termed as positive. The querying procedure is as follows [7]: BFQuery(y) for (i =1to k) if (BF[h i (y)] ==0)return negative; return positive; However, the positive does not mean that all those bitlocations were set only by that current element under querying and there is a possibility that those locations would have been set by some other elements in the set. This type of positive result is termed as a false positive. It is important to properly control the rate of the false positive in designing a Bloom filter. For a given ratio of m/n, it is known that the false positive probability is minimized when the number of hash functions k has the following relationship [7]: k = m ln 2 (1) n 2 log 2 On the whole, a Bloom filter may produce false positives but not false negatives. 3 RELATED WORKS IP address lookup problem can be defined formally as follows [8]. Let P = {P 1,P 2,,P N be a set of routing prefixes, where N is the number of prefixes. Let A be an incoming IP address and S(A, l) be a substring of the most significant l bits of A. Let n(p i ) be the length of a prefix P i. A is defined to match P i if S(A, n(p i )) = P i. Let M(A) be the set of prefixes in P that A matches, then M(A) ={Pi P : S(A, n(p i )) = P i. The longest prefix matching (LPM) problem is to find the prefix P j in M(A), such that n(p j ) >n(p i ) for all P i M(A),i j. Once the longest matching prefix P j is determined, the input packet is forwarded to an output port directed by the prefix P j. 3.1 IP Address Lookup Algorithms Using Bloom Filters An IP address lookup architecture proposed by Dharmapurikar et al. is the first algorithm employing a Bloom filter [7]. It performs parallel queries on W Bloom filters sorted by prefix length to determine the possible lengths of prefix match, where W is 32 in IPv4. For a given IP address, off-chip hash tables are probed for prefix lengths, which turn out to be positive in Bloom filters starting from the longest prefix. This architecture has a high implementation complexity because of the Bloom filters as well as the hash tables in each prefix length. Depending on the prefix distribution, the size of the Bloom filters and the size of the hash tables can be highly skewed. In order to bound the worst-case search performance by limiting the number of distinct prefix lengths, which is the same as the worst number of hash table probes, controlled prefix expansion (CPE) [22] is suggested in the paper. However, prefix replication is inevitable in the CPE. Moreover, the naive hash table employed in this architecture incurs collisions, and resolving the collisions using chaining adversely affects the worst-case lookuprate guarantees that routers should provide [30], [32].

3 3 3.2 Parallel Multiple-Hashing A hash function is used to map the search key to a hash index. In general, a hash function may map several different keys to the same hash index. This is called collision. Collisions are an intrinsic problem of hashing. Broder et al., proposed to use multiple-hash functions to reduce collisions [9]. Instead of searching for a perfect hash function in which each distinguishable search key is mapped to a different hash index, a multiple-hashing architecture [9] uses multiple hash functions for each search key. The number of hash tables is equal to the number of hash indices. Assuming two hash indices, the corresponding two hash tables are named the left table and the right table. Each slot of a hash table should contain a set of entries, and for this reason, each slot of a hash table is often called a bucket. In storing a given prefix, two hash indices are obtained; the hash index from the hash function 1 is used to access the left table, and the index from the hash function 2 is used to access the right table. Comparing the number of loads stored already in the two buckets accessed, the prefix is stored in the bucket with a smaller number of loads. By the multiple-hashing, prefixes are distributed more evenly into hash tables. The number of collisions can be controlled by three parameters: the number of hash tables, the number of buckets in a table, and the number of entries in a bucket. To apply multiple-hashing to an IP address lookup problem with variable-length prefixes, a parallel multiple-hashing (PMH) architecture [10] constructs a separate multi-hash table for each group of prefixes with a distinct length and additionally, an overflow TCAM. A prefix is stored into the overflow TCAM when both buckets are already full. Figure 1 shows the overall PMH architecture. Multiple hash tables (here two) are constructed for each length, and prefixes in each length are stored into either a left table entry or a right table entry of the corresponding length. The search procedure is as follows. For a given input address, hash indices for all possible lengths are obtained. Using these hash indices, multi-hashing tables are accessed in parallel, and matching prefixes in each length (if exist) are returned. The overflow TCAM is also assumed to be accessed in parallel. Among the returned prefixes, the longest matching prefix is selected by the priority encoder. By the parallel access of multi-hashing tables in every length, the best matching prefix (BMP) is obtained in a single access cycle. However, since tables in each length should be implemented using a separate memory for parallel access and the size of the tables can be highly skewed depending on prefix distribution, implementation complexity can become very high. 3.3 Binary Search on Trie Levels by Waldvogel A binary trie is a tree-based data structure which applies linear search on length [4]. Each prefix re- Fig. 1. Parallel multiple hashing-architecture. sides in a node of the trie, in which the level and the path of the node from the root node correspond to the prefix length and the value, respectively. Figure 2 shows the binary trie for an example set of prefixes P = {1, 00, 010, 111, 1101, 11111, In Fig. 2, black nodes represent prefixes, and white nodes represent empty internal nodes. At each node, the search proceeds to the left or right according to sequential inspection of address bits starting from the most significant bit. If it is 0, the search proceeds to a left child and otherwise proceeds to a right child, until it reaches a leaf node. The binary trie structure is simple and easy to implement. However, the search performance of the binary trie is linearly related to the length of IP address, since each bit is examined one at a time. Fig. 2. The binary trie for an example set of prefixes. As an attempt to improve the search performance of the trie, algorithms performing binary search on trie levels are proposed [13], [14]. The binary search on level structure proposed by Waldvogel et al. [13] separates the binary trie, according to the level of the trie, and stores nodes included in each level in a hash table. Binary search is performed on the hash tables of each level. When accessing the medium-level hash table, if there is a match, the search space becomes the longer half;

4 4 otherwise, the search space becomes a shorter half. Figure 3 shows the Waldvogel s binary search on length (W-BSL) structure with the denotation of access levels [13]. Level 3 is the first level of access, levels 1 and 5 are the second levels of access, and levels 2, 4, and 6 are the last levels of access. The W-BSL structure uses the pre-computed markers and BMPs. Markers are precomputed in the internal nodes, if there is a longer length prefix in the levels accessed later. A pre-computed BMP is maintained at each marker, and the pre-computed BMP is returned, when there is no match in longer levels. The markers and the BMPs are not maintained for the last level of access. They are pre-computed for nodes of preceding levels of access as shown in Fig. 3. a substring of another prefix. That is, since each node can have prefixes in both upper and lower levels, the markers and their BMPs should be pre-computed. If every prefix is disjointed in relation to each other, they are located only in leaves and free from a prefix nesting relationship. Hence, the binary search on trie levels for the set of disjoint prefixes can be performed without precomputation. Lim s binary search on level structure (L-BSL) [14] uses leaf-pushing [22] to make every prefix disjoint. Figure 4 shows the leaf-pushed binary trie for the same set of prefixes. It is represented that leaf-pushed nodes are connected to the trie by dotted edges. The levels of access performing the binary search on levels are also shown. Assume a 6-bit input Note that we use a different input example from the W-BSL to show the search procedure for the L-BSL in detail. Since the first level of access is 4, the most significant 4 bits, which are 1101, are used for hashing. As the search encounters an internal node, it is proceeded to a longer level. In level 6, the most significant 6 bits of the input, , do not match any node. Hence the search proceeds to a shorter level, which is level 5. The input matches prefix P 4 in the level 5. The prefix P 4 is returned and the search is over. The L-BSL finishes a search, either when a match to a prefix occurs or when it reaches the last level of access, while the W-BSL always finishes a search when it reaches the last level of access. Fig. 3. Waldvogel s binary search on lengths (W-BSL). As a search example, for a 6-bit input , the most significant 3 bits, which are 111, are used for hashing in accessing level 3. Since the input matches P 5,itis remembered as the current BMP, and search goes to level 5. In level 5, the most significant 5 bits of the input, 11100, does not match any node. The search goes to level 4 and does not match. Since it is the last level of access, the search is over and prefix P 5 is returned as the BMP. Using the binary search on trie levels, three memory accesses were performed to find the longest matching prefix for this input. Waldvogel s algorithm provides O(logl dist ) hash-table accesses, where l dist is the number of distinct prefix lengths. Srinivasan and Varghese have proposed to improve the search performance by the use of controlled prefix expansion reducing the value of l dist [22]. Kim and Sahni have proposed to optimize the storage requirement by selecting prefix lengths minimizing the number of markers and pre-computed BMPs when the l dist is given [15]. 3.4 Binary Search on Trie Levels in a Leaf-Pushed Trie W-BSL requires complex pre-computation because of the prefix nesting relationship in which one prefix becomes Fig. 4. Lim s binary search on lengths (L-BSL). 4 THE PROPOSED ARCHITECTURES 4.1 Adding a Bloom Filter to Parallel Multiple- Hashing Architecture (PMH-BF) In this subsection, we propose to add an on-chip Bloom filter to the PMH architecture in order to reduce the implementation complexity. In implementing hash tables, previous architectures, such as [7], [9], [10], [11], [13] and [32], require separate hash tables for each length, and this increases the implementation complexity by requiring multiple variable-sized memories. To reduce the required number of memories by reducing the distinct number of prefix lengths, [7] suggested using CPE. However, CPE causes prefix replications and increases the

5 5 memory requirement. In [32], prefix collapsing is suggested. However, prefix collapsing increases the number of collisions in hashing. On the whole, in using the hashing for IP address lookup, it is essential to reduce the required number of off-chip memories. If there is no need for parallel access in each prefix length, the hash table can be designed to accommodate prefixes with various lengths within a single table. We describe the proposed architecture using the same example set. We use a single hash function based on a cyclic redundancy check (CRC) generator to obtain multiple-hash indices for prefixes in every length [29]. The CRC generator is composed of shift-right registers with XOR logic, and hence it is easy to implement. There is an advantage to using a CRC generator as a hash generator; hash indices can be obtained consistently for various-length prefixes. Figure 5 shows an example of an 8-bit CRC generator. All the registers of the CRC generator are initially set to 0. Once a prefix with an arbitrary length is serially entered to the CRC generator and XORed with a stored register value, a fixed-length scrambled code is obtained. By selecting a set of registers or multiple sets of registers from the scrambled code, we obtain as many hash indices as desired in any length. In Fig. 6, we show the overall structure of our proposed architecture. The on-chip Bloom filter was programmed using the BF indices in TABLE 1. The multihashing table, which is composed of two hash tables, has 8 buckets and 2 entries per bucket storing the prefixes. Prefixes were stored into the multi-hashing table using the indices shown in TABLE 1. In this example, there is no overflow. Fig. 6. Our proposed multiple hashing architecture (PMH- BF). Fig. 5. CRC-8 generator. Let n, m, and k be the number of prefixes, the number of bits in a Bloom filter, and the number of hash indices, respectively. As an example case, we set the Bloom filter size m as 16. The number of hash indices k should be derived from Eq. (1), and here we set k as 2. Since m is 16, we need 4 bits for each hash index. We arbitrarily select the first 4 bits and the last 4 bits from the 8-bit CRC codes for Bloom filter indices. Assuming an 8-bucket multihashing table, we can also obtain hash indices for the multi-hashing table from the CRC code. We arbitrarily select the first 3 bits and the last 3 bits from the 8-bit CRC codes for the multi-hashing indices. TABLE 1 shows the CRC codes for the example set of prefixes and selected indices. TABLE 1 CRC code and Bloom filter index for each prefix Prefix CRC Code BF Indices Hash Table Indices 1* , 11 5, 3 00* , 0 0, 0 010* , 5 6, 5 111* , 4 4, * , 4 1, * , 11 2, * , 6 5, 6 The search procedure for the proposed algorithm is summarized in the following pseudo-code. Let A be the destination address of a given input packet, and S(A, l) be the sub-string of the most significant l bits of A. Let n(x) be the length of an element x. SearchMHB(A) { TCAM BMP = TCAM search(a); if (TCAM BMP!= NULL) len = n(tcam BMP); else len = 0; for (l = W to len+1 ) { if (valid[l] ==1){ // l is a valid length instring = S(A, l); CRC = crc gen (instring); for (i =1to k) bf idx[i] = extract(crc); rst = probe BF (bf idx[1],,bf idx[k]); if ( rst == positive) { for (i =1to 2) h idx[i] = extract(crc); BMP = hash table(instring, h idx[1], h idx[2]); if (BMP!= NULL) return BMP; return TCAM BMP; In this example set, the valid levels are length 6, 5, 4, 3, 2, and 1. As a search example of 6-bit input address

6 , the CRC code generated for this input using the 8-bit CRC generator shown in Fig. 5 is Hence, by selecting the first 4 bits and last 4 bits, we have BF indices 9 and 2. The Bloom filter shown in Fig. 6 produces a negative since one of the Bloom filter bits is 0, and hence the hash table is not accessed. Reducing the input by 1 bit, is entered, and the generated code is The BF indices are 2 and 5, and the Bloom filter produces a negative, and hence the hash table is not accessed. Next, the input 1110 is tried. The generated CRC is , and the BF indices are 4 and 10. The Bloom filter produces a positive, and the hash table is accessed using two hash indices obtained by the first 3 bits and the last 3 bits of the CRC code, which are 2 and 2. There is a match neither in bucket 2 of the left table nor bucket 2 in the right table for 1110, so it turns out that the Bloom filter produced a false positive. Next, the input 111 is tried. The generated CRC is , and the BF indices are 9 and 4. The Bloom filter produces a positive, and the hash table is accessed. We obtain a matched prefix in the bucket 4 of the left table. Therefore, the search is over. The search for a given input is terminated when a true positive is occurred. In this example, the Bloom filter was accessed 4 times, lengths 6, 5, 4, and 3, and it generated 2 negatives, 1 false positive, and 1 true positive. The number of hash table accesses is 2, which is the same as the number of the Bloom filter positives. As will be shown in simulation section, the false positive can be reduced to less than 0.3% by increasing the size of Bloom filters to 16 times the number of prefixes. The parallel accesses to hash tables in each length is not necessary in the proposed architecture since the Bloom filter filters out the length of the input that does not have a matching prefix. The Bloom filter is small enough to be implemented in a fast cache or embedded in a chip. Hence the implementation complexity is significantly reduced by adding a simple Bloom filter without sacrificing the search performance. 4.2 Adding a Bloom Filter to Binary Search on Trie Levels by Waldvogel (WBSL-BF) In this subsection, we propose to add an on-chip Bloom filter to the Waldvogel s binary search on level algorithm in order to improve the search performance. The preliminary version of this proposal was presented in [26]. The proposed algorithm is described in detail in the context of adding a Bloom filter. The role of the Bloom filter is to filter out the substring of each input that does not have a node in the binary trie. TABLE 2 shows CRC codes and Bloom filter indices for every node in the level of access (level 1 through 6) for the W-BSL trie shown in Fig. 3. We use the CRC-8 generator as in Fig. 5 to obtain BF indices. Figure 7 shows the WBSL-BF trie, which has a Bloom filter programmed using the Bloom filter indices. The search procedure for the WBSL-BF is summarized in the following pseudo-code. TABLE 2 CRC codes and Bloom filter indices for WBSL-BF Prefix CRC Code BF Indices 0* , 0 1* , 11 00* , 0 01* , 11 11* , * , 5 110* , * , * , * , * , * , * , 6 Fig. 7. Adding a Bloom filter to W-BSL (WBSL-BF). SearchWBSL(A) { TCAM BMP = TCAM search(a); low = min level; high = max level; while (low high) { acclevel = (low + high) /2 ; instring = S(A, acclevel); CRC = crc gen (instring); for (i =1to k) bf idx[i] = extract(crc); rst = probe BF (bf idx[1],,bf idx[k]); if ( rst == negative) high = acclevel - 1; else // positive { h idx = extract(crc); node = access hash table(instring, h idx); if (node == NULL) high = acclevel - 1; else { //a node exists low = acclevel + 1; if (prefix node or bmp node) tree BMP = node.bmp; if (n(tcam BMP) <n(tree BMP)) return tree BMP; else return TCAM BMP;

7 7 As a search example for the , the CRC code generated for the 3-bit substring 111 is Hence, we have BF indices 9 and 4. The Bloom filter shown in Fig. 7 produces a positive, and hence the hash table is accessed and P 5 is obtained as the current best match. For the 5-bit substring of the input 11100, the CRC code is , and hence the BF indices are 2 and 5, and it is a negative. Hence hash table is not accessed, and the search space becomes the shorter lengths. For 4-bit substring of the input 1110, the CRC code is , and hence the BF indices are 4 and 10, and it is a positive. Hence hash table is accessed. There is no match, so it turns out to be a false positive. The P 5 is returned as BMP. Compared with the search procedure in W-BSL, the number of off-chip hash table accesses is reduced from 3 to 2 because of the Bloom filter which produced one negative. 4.3 Adding a Bloom Filter to Binary Search on Trie Levels in a Leaf-Pushed Trie (LBSL-BF) TABLE 3 shows CRC codes and Bloom filter indices for every node in the level of accesses (level 2 through 6) for the L-BSL trie shown in Fig. 4. Figure 8 shows the LBSL- BF trie, which has a Bloom filter programmed using the Bloom filter indices. TABLE 3 CRC codes and Bloom filter indices for LBSL-BF Prefix CRC Code BF Indices 00* , 0 01* , 11 10* , 5 11* , * , 5 110* , * , * , * , * , * , * , * , * , * , * , * , 6 The search procedure for the LBSL-BF is summarized in the following pseudo-code. SearchLBSL(A) { TCAM BMP = TCAM search(a); low = min level; high = max level; while (low high) { acclevel = (low + high) /2 ; instring = S(A, acclevel); CRC = crc gen (instring); for (i =1to k) bf idx[i] = extract(crc); rst = probe BF (bf idx[1],,bf idx[k]); if ( rst == negative) high = acclevel - 1; else // positive { Fig. 8. Adding a Bloom filter to L-BSL (LBSL-BF). h idx = extract(crc); node = access hash table(instring, h idx); if (node == NULL) high = acclevel - 1; else if (node.type == internal) low = acclevel + 1; else { // prefix node tree BMP = node.bmp; if (n(tcam BMP) <n(tree BMP)) return tree BMP; else return TCAM BMP; return TCAM BMP; For an example input , at level 4, the CRC code of the 4-bit string of this input is , and hence we have BF indices 3 and 4. The Bloom filter produces a positive, and hence the hash table is accessed and encounters an internal node. For 6-bit string , the CRC code is , and hence the BF indices are 13 and 8. It is a negative, and hence the hash table is not accessed. For 5-bit substring 11011, the CRC code is , and hence the BF indices are 11 and 1, and it is a positive. The hash table is accessed and it encounters P 4 stored in the hash table; P 4 is returned as BMP. Compared with the search procedure in L-BSL, the number of off-chip hash table accesses is reduced from 3 to 2 because of one Bloom filter negative. (In determining the sequence of access levels, we can use either a flooring operation or a ceiling operation. The ceiling operation is used in this example.) 5 PERFORMANCE EVALUATION 5.1 Search Performance Improvement by Adding a Bloom Filter to Multiple-Hashing Architecture A performance evaluation was executed using 5 different sets of actual back-bone routing data [35] with C++

8 8 language. Throughout our simulation, hash indices were consistently generated using a 32-bit CRC generator [29]. The number of overflows depends on several factors, such as the number of hash tables, the number of buckets in each hash table, and the number of entries in each bucket. For the number of prefixes, N, we have two hash tables, each of which has N =2 log 2 N buckets. Each bucket has two entries, and each entry has 46 bits. For 5 different prefix sets, there is no overflow except for one overflow occurred in PORT80. We assume to store overflow prefixes in a TCAM. Figure 9 shows the entry structure of the multihashing table. Hence, the memory requirement is 4 x 46 x N bits. Fig. 9. Entry structure of multi-hashing table From our simulation, we found out the average number of hash table access (H avg ) is not as low as we expected, because of the false positives of the Bloom filter. Unexpected false positives are caused by prefixes having the same bit patterns but having different lengths. To eliminate the false positives, the hash key of the Bloom filter should have more information than prefix value itself. We padded zeros after each prefix to make every prefix become 32 bits and attached 6 bits of the prefix length information after that. Each hash key now has 38 bits. Performance evaluation result for the sets of prefixes is shown in TABLE 4. The number of prefixes in each routing set is shown inside the parenthesis. The number of input traces generated for simulation is 3 times the number of prefixes. The size of the Bloom filter M is set proportional to N. The number of hash indices, K, is calculated using Eq. (1), and the result is 2, 2, 3, 6, and 11, respectively, for N, 2N, 4N, 8N, and 16N. An input IP address is probed only for distinct lengths that exist in the routing data, starting from the longest length. If a positive is returned from the Bloom filter, the hash table is probed. If it has a match in the hash table, the search is over. If it does not have a match, the same procedure is repeated for next longest length, and so on. The maximum number of Bloom filter queries (Q max ) can be up to the number of distinct lengths that exist in the routing data. Since the Bloom filter query is stopped when the longest matching prefix is found, the average number of Bloom filter queries (Q avg ) is smaller than the maximum. As the size of the Bloom filter increases, the maximum number of hash table probes (H max ) and the average number of hash table probes (H avg ) decrease as expected, since the number of negatives from the Bloom filter increases and the number of false positives decreases. The hash table access rate in the last column is the average number of hash table probes versus the average number of Bloom filter queries, i.e. H avg /Q avg. As the size of the Bloom filter increases from N to 16N, the hash table access rate is exponentially reduced. This means that the Bloom filter effectively avoids unnecessary memory accesses to the off-chip hash table as the size of the Bloom filter increases. When the size of the Bloom filter is 16N, the maximum number of hash table probes (H max ) is bigger than one, but the average number of hash table access (H avg ) is only 1. For a given input trace, if the Bloom filter has false positives, the number of hash table probes becomes more than one. For the Bloom filter size 16N, TABLE 5 shows the total number of input traces injected to our simulation and the number of inputs that have at least one false positives. As shown, the number of inputs with at least one false positive are very small compared with the total number of input traces. Hence the fractional part of the average number of hash table access (H avg ) is zero up to the second digit under the decimal point. It means that most of the false positives are removed when the size of Bloom filter is sufficiently large, and each IP address lookup only requires one off-chip hash table access on average. Hence the complex hardware for parallel access and multiple separate memories required in the previous architecture are effectively removed by adding a simple on-chip Bloom filter that has a size of up to 512Kbytes for about 200K prefixes. TABLE 6 shows the performance comparison of with and without Bloom filter for PMH algorithm. When there is no Bloom filter, the average number of off-chip memory accesses for an IP address lookup using PMH algorithm is 6.77 to 11.96, and the maximum number is 22 to 30. In our PMH-BF algorithm, the size of the Bloom filter is 32Kbytes to 512Kbytes, which is 16x of N, where N =2 log 2 N for N prefixes. For each given input, off-chip hash table accesses are avoided for the length that is a negative in the Bloom filter query. The negative means that there is no prefix corresponding to the specific length of the input. The average number of off-chip memory accesses for an IP address lookup becomes 1.00 for every case, and the maximum number is2or3. Simulations have been performed to compare the search performance of our proposed PMH-BF algorithm with that of the Dharmapurikar s algorithm (D-BF) [7]. The D-BF algorithm applies controlled prefix expansion to bound the worst-case search performance as 3 hash table accesses. A direct lookup array is used for prefixes of length less than or equal to 20 bits. Prefixes of length 21 to 23 are expanded to length 24, and prefixes of length 25 to 31 are expanded to length 32. The D-BF algorithm requires two Bloom filters; one for prefixes of length 24 and the other for prefixes of length 32. For fair comparison, we implemented both algorithms using same constraints in terms of memory amount for Bloom filters, memory amount for hash table implementation, hash functions, and the handling of collided hash keys.

9 9 TABLE 4 Performance evaluation results of PMH-BF Bloom Filter Multi-hashing Table Routing Data N M Size K Q max Q avg No. of Size H max H avg Hash Table (N) (Kbyte) Buckets (Mbyte) Access Rate MAE-WEST N (14553) 2N N N N MAE-EAST N (39464) 2N N N N PORT N (112310) 2N N N N Grouptlcom N (170601) 2N N N N Telstra N (227223) 2N N N N TABLE 5 Total number of input traces and the number of inputs with at least one false positive MAE-WEST MAE-EAST PORT80 Grouptlcom Telstra No. of inputs No. of inputs with false positives Rate TABLE 6 Comparison in search performance with and without Bloom filter for PMH Without Bloom Filter With Bloom Filter (PMH-BF) Prefix Set Avg. Memory Access Max. Memory Access BF Size (Kbyte) Avg. Memory Access Max. Memory Access MAE-WEST MAE-EAST PORT Grouptlcom Telstra In generating hash indices for a Bloom filter, the ANSI C function rand() was used as suggested in [7] for both algorithms. Each hash key used as an input to the rand() is less than or equal to 32 bits. In generating hash keys for our proposed algorithm, 6 bits of prefix length information were attached after each prefix unless the entire length is not longer than 32 bits. TABLE 7 shows the result. The memory size for the Bloom filters was fixed as 16N bits for N prefixes. Since our proposed PMH-BF has a single Bloom filter, 16N bits were used for the Bloom filter. For the D-BF algorithm, bits were allocated for two Bloom filters proportional to the number of prefixes. Multi-hashing tables were used for both algorithms. The memory amount for hash table implementation was determined as follows. For the D-BF algorithm, we allocated 1Mbytes for the direct lookup array. The number of hash table entries for other prefixes was determined as 2(N 24 + N 32 ), where N 24 is the number of prefixes with length 24 and N 32 is the number of prefixes with length 32. The same amount of memory required for the implementation of the D-BF algorithm was allocated for the multi-hashing table of the PMH-BF. TABLE 7 shows that the D-BF algorithm has large prefix replication, in which the number of prefixes stored in the hash table is more than the number of prefixes. Hence the algorithm shows slightly worse search performance both in the average and in the worst case than our proposed algorithm. It was assumed that collided prefixes are connected by a linked list without assuming

10 10 a perfect hash function for a given set of prefixes in this simulation. Hence the worst case number of hash table accesses is not bounded by 3 for [7] as shown in Grouptlcom and Telstra. 5.2 Search Performance Improvement by Adding a Bloom Filter to W-BSL In implementing binary search on level algorithms, if there is no prefix in a level of the binary trie, it is an invalid level and nodes in the invalid level are not stored into a Bloom filter and a hash table. Every node including prefix nodes and internal nodes in valid levels are stored into the Bloom filter and the hash table. Throughout this simulation, we assume to have a perfect hash function in storing nodes of the trie to an off-chip hash table. The worst-case number of off-chip memory access is equal to log 2 (W +1) in the W-BSL algorithm and it is 6. TABLE 8 shows the performance comparison for W- BSL algorithm. The number of nodes represents the total number of nodes stored. When there is no Bloom filter, the average number of off-chip memory accesses for an IP address lookup using W-BSL algorithm is 4.33 to The simulation result shown in TABLE 8 is where the size of Bloom filter is 8U, where U =2 log 2 U and U is the total number of nodes. The size of the Bloom filter is 16 to 64Kbytes. The average number of Bloom filter queries is equal to the average number of memory accesses when there is no Bloom filter. For each given input, off-chip hash table accesses is avoided for the length that is a negative in the Bloom filter query. The negative means that there is no node corresponding to the specific length of the input in the trie. The average number of off-chip memory accesses for an IP address lookup for the WBSL- BF became 2.50 to 3.19, and hence the number of memory accesses was reduced by around 40% by adding a Bloom filter. 5.3 Search Performance Improvement by Adding a Bloom Filter to L-BSL TABLE 9 shows the performance evaluation result for the L-BSL algorithm. As the same as the W-BSL case, if there is no prefix in a level of a leaf-pushed trie, the level is an invalid level, and nodes in invalid levels are not stored into a Bloom filter and a hash table. The number of nodes represents the total number of nodes including prefix nodes and internal nodes in valid levels. The number inside the parenthesis represents the number of prefix nodes after the leaf-pushing. The number of prefix nodes after leaf-pushing is slightly more than the number of original prefix nodes. When there is no Bloom filter, the average number of off-chip memory accesses for an IP address lookup is 3.57 to The search performance of L-BSL is slightly better than the W-BSL case since a search can be terminated when a prefix node is encountered even though it is not a last level of access. The simulation result is also where the size of Bloom filter is 8U. The size of the Bloom filter is 16 to 128Kbytes. The average number of off-chip memory accesses for an IP address lookup for the LBSL-BF became 2.62 to 3.47, and hence the number of memory accesses was reduced by approximately 30% by adding a Bloom filter. The performance improvement is smaller in the LBSL-BF than that of the WBSL-BF. Because of the leaf-pushing, many internal nodes are created, and hence a smaller number of Bloom filter negatives was produced in the LBSL-BF. 5.4 Search Performance Comparison with Other Algorithms This section shows simulation results comparing with other algorithms in terms of required memory amount and search performance. Algorithms in comparison are binary trie (B-Trie) [4], priority-trie (P-Trie) [23], binary search on range (BSR) [24], binary search with prefix vector (BST-PV) [8], Waldvogel s BSL (W-BSL) [13], Lim s BSL (L-BSL) [14], logw -Elevator algorithm (logw-e) [25], Dharmapurikar s algorithm (D-BF) [7], and proposed algorithms (PMH-BF, WBSL-BF, LBSL-BF). The details of B-Trie, P-Trie, BSR, BST-PV, and logw-e algorithms can be found in [4]. Our simulation has used the same prefix sets as those used in [4]. TABLE 10 shows the required memory amount for each algorithm. For the algorithms requiring a Bloom filter, which are the D-BF, the PMH-BF, the WBSL-BF, and the LBSL-BF, the required memory amounts for Bloom filters are also shown. The sizes of the Bloom filers are reasonably small so that each Bloom filter can be embedded in a chip. The memory amount for the hash table implementation of the WBSL-BF and the LBSL- BF is the same as that of the W-BSL and the L-BSL, respectively. Algorithms requiring multi-hashing table such as the D-BF and the PMH-BF generally consume more memory than other algorithms, while they provide the better search performance as will be shown shortly. Figures 10 and 11 show the worst-case search performance and the average-case search performance, respectively. The D-BF algorithm and the proposed PMH- BF algorithm provide the best performance both in the worst-case and the average-case search performance. The WBSL-BF and the LBSL-BF are the next. The search performances of known algorithms were effectively improved by adding an on-chip Bloom filter. 6 CONCLUSION This paper shows how effectively an on-chip Bloom filter can improve the search performance of known efficient IP address lookup algorithms. The parallel multiplehashing architecture provides high-speed IP address lookup with a single access cycle of off-chip memory, but it requires a complicated hardware for parallel accesses to the separate memories storing prefixes in each length. This paper shows how to avoid the parallel access to

11 11 TABLE 7 Search performance comparison with [7] BF size HT size [7] Proposed (PMH-BF) Routing Data (Kbyte) (Mbyte) Prefix replication factor H max H avg Prefix replication factor H max H avg MAE-WEST MAE-EAST PORT Grouptlcom Telstra TABLE 8 Comparison in average search performance with and without Bloom filter for W-BSL Trie Characteristics Without Bloom Filter With Bloom Filter (WBSL-BF) Prefix Set No. of Inputs No. of Nodes No of Valid Levels Memory Access BF Size Memory Access Max. Avg. (Kbyte) Max. Avg. MAE-WEST MAE-EAST PORT Grouptlcom Telstra TABLE 9 Comparison in average search performance with and without Bloom filter for L-BSL Trie Characteristics Without Bloom Filter With Bloom Filter (LBSL-BF) Prefix Set No. of Inputs No. of Nodes No. of Valid Levels Memory Access BF Size Memory Access (prefix nodes) Max. Avg. (Kbyte) Max. Avg. MAE-WEST (19968) MAE-EAST (59377) PORT (145267) Grouptlcom (203093) Telstra (285741) TABLE 10 Memory requirement (Mbyte) Prefix Set B-Trie P-Trie BSR BST-PV logw-e W-BSL L-BSL D-BF PMH-BF WBSL-BF LBSL-BF BF* HT BF* HT BF* HT BF* HT Mae-West Mae-East PORT Grouptlcom Telstra *Each Bloom filter size is in Kbyte. off-chip memories by adding a small on-chip Bloom filter. For a given input, the Bloom filter is queried first, starting from the longest length. If it turns out to be a negative, access to the off-chip hash table is avoided for that specific length. The off-chip hash table is accessed only for the positive result of the Bloom filter. When it turns out a true positive, the search for the input is finished. It is shown that the proposed architecture provides compatible average search performance with the parallel multiple-hashing by properly controlling the false positive rate. The proposed architecture requires much less hardware since it only has a small on-chip Bloom filter and a single multi-hashing table and does not require complicated hardware or separate memories for parallel access. Among trie-based algorithms, algorithms based on binary search on trie levels provide the best search performance since their performance is proportional to O(logl dist ), where l dist is the distinct prefix lengths. This paper shows how to improve further the search performance of those algorithms by adding a simple onchip Bloom filter. For each given input, the Bloom filter is queried first for the current level of access. If it turns out to be a negative, it means that there is no node in the trie. Hence the search can proceed to a shorter level without accessing the off-chip hash table. It is shown that the average search performance is improved by 30-40% by effectively avoiding the access of off-chip hash table when there is no node in the current level. Multi-bit tries with controlled prefix expansion [22], [33]-[34] provide better search performance than binary tries by reducing the number of distinct levels. Binary search on trie levels can be applied to multi-bit tries without the loss of generality, and hence

12 (a) (a) (b) (b) (c) (c) (d) (d) (e) Fig. 10.

(e) Telstra (227223 prefixes). (e) Fig. 11.

12 12 (a) (a) (b) (b) (c) (c) (d) (d) (e) Fig. 10. Worst case number of memory accesses for each algorithm. (a) Mae-West (14553 prefixes). (b) Mae- East (39464 prefixes). (c) PORT80 ( prefixes). (d) Grouptlcom ( prefixes). (e) Telstra ( prefixes). (e) Fig. 11. Average number of memory accesses for each algorithm. (a) Mae-West (14553 prefixes). (b) Mae- East (39464 prefixes). (c) PORT80 ( prefixes). (d) Grouptlcom ( prefixes). (e) Telstra ( prefixes).

13 13 our proposed approach using an on-chip Bloom filter also can be applied to the binary search on trie levels at a multi-bit trie. We believe that a Bloom filter is a simple but extremely powerful data structure that will improve the performance of many other applications as well [31], and we are actively seeking possible applications. ACKNOWLEDGMENTS This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) ( ). This research was also supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC support program supervised by the NIPA (NIPA-2012-H ). The preparation of this paper would not have been possible without the efforts of our students in SoC Design Lab at Ewha W. University on simulations. We are particularly grateful for Jungwon Lee and Youngju Choi. REFERENCES [1] H. Jonathan Chao, Next generation routers, Proceedings of the IEEE, vol.90, no.9, pp , Sept [2] M. A. Ruiz-Sanchez, E. M. Biersack and W. Dabbous, Survey and taxonomy of IP address lookup algorithms, IEEE Networks, vol. 15, no. 2, pp. 8 23, March/April [3] S. Sahni, K. Kim, and H. Lu, Data structures for one-dimensional packet classification using most-specific-rule matching,, International Journal on Foundations of Computer Science, vol.14, no.3, pp , [4] H. Lim and N. Lee, Survey and proposal on binary search algorithms for longest prefix match IEEE Communications Surverys and Tutorials, pp.1 17, 2012 (IEEE early access). [5] F. Yu, R. H. Katz, T. V. Lakshman, Efficient multimatch packet classification and lookup with TCAM, IEEE Micro, vol. 25, no. 1, pp , Jan/Feb [6] H. Lu and S. Sahni, Dynamic tree bitmap for IP lookup and update, International Conference on Networking, [7] S. Dharmapurikar, P. Krishnamurthy, and D. Taylor, Longest prefix matching using Bloom filters, IEEE/ACM Trans. Networking, vol.14, no.2, pp , Feb [8] H. Lim, H. Kim, and C. Yim, IP address lookup for Internet routers using balanced binary search with prefix vector, IEEE Trans. on Communications, vol.57, no.3, pp , Mar [9] A. Broder and M. Mitzenmacher, Using multiple hash functions to improve IP lookups, IEEE Infocom, vol.3, pp , [10] H. Lim and Y. J. Jung, A parallel multiple hashing architecture for IP address lookup, IEEE HPSR, pp.91 95, [11] H. Lim, J. Seo, and Y. Jung, High speed IP address lookup architecture using hashing, IEEE Communications Letters, vol.7, no.10, pp , Oct [12] W. Eatherton, G. Varghese, and Z. Dittia, Tree bitmap: Hardware/software IP lookups with incremental updates, ACM SIG- COMM Computer Communications Review, vol.34, no.2, pp , Apr [13] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, Scalable high speed IP routing lookups, Proc. ACM SIGCOMM, 1997, pp [14] J. H. Mun, H. Lim and C. Yim, Binary search on prefix lengths for IP address lookup, IEEE Communications Letters, vol.10, no.6, pp , June [15] K. Kim and S. Sahni, IP lookup by binary search on prefix length, Journal of Interconnection Networks, vol.3, pp , [16] W. Lu and S. Sahni, Succinct representation of static packet classifiers, IEEE/ACM Transactions on Networking, vol.17, no.3, pp , [17] W. Lu and S. Sahni, Recursively partitioned static router tables, IEEE Transactions on Computers, vol.59, no.12, pp , [18] S. Sahni and K. Kim, An O(logn) dynamic router table design, IEEE Trans. on Computers, vol.53, no.3, pp , Mar [19] H. Lu and S. Sahni, O(logn) dynamic router-tables for prefixes and ranges, IEEE Trans. on Computers, vol.53, no.10, pp , Oct [20] W. Lu and S. Sahni, Packet classification using space-efficient pipelined multi-bit tries, IEEE Trans. on Computers, vol.57, no.5, pp , May [21] K. Kim and S. Sahni, Efficient construction of pipelined multibittrie router-tables, IEEE Trans. on Computers, vol.56, no.1, pp.32 43, Jan [22] V. Srinivasan and G. Varghese, Fast address lookups using controlled prefix expansion, ACM Transactions on Computer Systems, vol.17, no.1, pp.1 40, Feb [23] H. Lim, C. Yim, and E. E. Swartzlander, Jr., Priority trie for IP address lookup, IEEE Trans. on Computers, vol.59, no.6, pp , Jun [24] B. Lampson, B. Srinivasan, and G. Varghese, IP lookups using multiway and multicolumn search, IEEE/ACM Trans. Networking, vol. 7, no. 3, pp , [25] R. Sangireddy, N. Futamura, S. Aluru, and A. K. Somani, Scalable, memory efficient, high-speed algorithms for IP lookups, IEEE/ACM Trans. on Networking, vol.13, no.4, pp , Aug [26] K. Lim, K. Park, and H. Lim, Binary search on levels using a Bloom filter for IPv6 address lookup, IEEE/ACM ANCS, 2009, pp [27] P. Panda, N. Dutt, and A. Nicolau, On-Chip vs. Off-Chip Memory: The data partitioning problem in embedded processor-based systems, ACM Transactions on Design Automation of Electronics Systems, vol.5, no.3, pp , July [28] H. Lim and S. Kim, Tuple pruning using Bloom filters for packet classification, IEEE Micro, vol.30, no.3, pp , May/June [29] A.G. Alagu Priya and H. Lim, Hierarchical packet classification using a Bloom filter and rule-priority tries, Computer Communications, vol.33, no.10, pp , Jun [30] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood, Fast hash table lookup using extended Bloom filter: An aid to network processing, Proc. ACM SIGCOMM, Aug [31] S. Taroma, C. E. Rothenberg, and E. Lagerspetz, Theory and practice of Bloom filters for distributed systems, IEEE Communications Surveys and Tutorials, vol.14, no.1, pp , first quarter, [32] J. Hasan, S. Cadambi, V. Jakkula, and S. Chakradhar, Chisel: A storage-efficient, collision-free hash-based network processing Architecture, Proc. ISCA, pp , [33] W. Lu and S. Sahni, Packet forwarding using pipelined multibit tries, IEEE Symposium on Computers and Communications, pp , May [34] W. Lu and S. Sahni, Packet classification using pipelined twodimensional multibit tries, IEEE Symposium on Computers and Communications, pp , May [35] Hyesook Lim (M 91) received a B.S. degree and an M.S. degree from the Department of Control and Instrumentation Engineering at Seoul National University, Seoul, Korea, in 1986 and 1991, respectively. She got the Ph.D. degree from the University of Texas at Austin, in From 1996 to 2000, she was employed as a member of the technical staff at Bell Labs of Lucent Technologies, Murray Hill in New Jersey. From 2000 to 2002, she worked for Cisco Systems, San Jose in California. She is currently a professor in the Department of Electronics Engineering, Ewha Womans University, Seoul, Korea. Her research interests include router design issues such as address lookup and packet classification, Bloom filter application to various distributed algorithms, and the hardware implementation of various network algorithms.

14 Kyuhee Lim received a B.S. degree from the Department of Electronics Engineering from Ewha Womans University, Seoul, Korea, in 2005.

Her research interests include address lookup and packet classification algorithms and TCAM architecture design. Nara Lee received a B.S.

Her research interests include various network algorithms such as IP address lookup and packet classification, web caching, and Bloom filter application to various distributed

14 14 Kyuhee Lim received a B.S. degree from the Department of Electronics Engineering from Ewha Womans University, Seoul, Korea, in From 2005 to 2009, she was employed at Hynix Semiconductor, Korea, where she was working for memory design. She is currently pursuing a Ph.D. degree from the same university. Her research interests include address lookup and packet classification algorithms and TCAM architecture design. Nara Lee received a B.S. degree and an M.S. degree from the Department of Electronics Engineering at Ewha Womans University, Seoul, Korea, in 2009 and 2012, respectively. Her research interests include various network algorithms such as IP address lookup and packet classification, web caching, and Bloom filter application to various distributed algorithms. Kyong-hye Park received a B.S. degree and an M.S. degree from the Department of Electronics Engineering at Ewha Womans University, Seoul, Korea, in 2007 and 2009, respectively. She works for Mobile Communication Business Unit, Samsung Electronics, Korea, where she is currently developing Android handsets.

TUPLE PRUNING USING BLOOM FILTERS FOR PACKET CLASSIFICATION

... TUPLE PRUNING USING BLOOM FILTERS FOR PACKET CLASSIFICATION... TUPLE PRUNING FOR PACKET CLASSIFICATION PROVIDES FAST SEARCH AND A LOW IMPLEMENTATION COMPLEXITY. THE TUPLE PRUNING ALGORITHM REDUCES