An Efficient Parallel TCAM Scheme for the Forwarding Engine of the Next-generation Router

Size: px

Start display at page:

Download "An Efficient Parallel TCAM Scheme for the Forwarding Engine of the Next-generation Router"

Dulcie Wade
5 years ago
Views:

1 An Efficient Parallel TCAM Scheme for the Forwarding Engine of the Next-generation Router Bin Zhang, Jiahai Yang, Jianping Wu, Qi Li, Donghong Qin Network Research Center, Tsinghua University Tsinghua National Laboratory for Information Science and Technology(TNList) Beijing, China zhang Abstract Ternary Content-Addressable Memory (TCAM) is a popular hardware device for fast IP address lookup. High link transmission speed of Internet backbone demands more powerful IP address lookup engine. Restricted by the memory access speed, the lookup engine for next-generation routers demands exploiting parallelism among multiple TCAM chips. However, most existing schemes improve lookup performance and reduce power consumption but ignore the update efficiency. In this paper, we propose a crossed address range division and shared caching scheme. We improve the update efficiency significantly by buddy update method while keep low power dissipation by decreasing the number of the triggered TCAMs access in each lookup operation. The lookup throughput is ultra high through adaptive load balance. Our simulation results show that the proposed scheme can achieve an average lookup speedup factor greater than 11 with 12 TCAM chips, on the cost of 10% more memory space and an additional cache chip. I. INTRODUCTION Fast lookup is a key technique for the development of high performance routers [1]. Existing schemes can be classified into Trie-based software approaches and TCAM based hardware approaches. Although many improved methods have been proposed to optimize the original Trie structures [2],[3], Trie-based IP lookup schemes still require several memory accesses per lookup and those accesses may be serialized. It is hard to further improve the performance of the lookup speed in Trie-based approaches because of its intrinsic characteristics, such as access speed of DRAM/SRAM. In contrast, TCAM can perform a lookup operation in a single cycle owing to its parallel access characteristics. Moreover, updating the forwarding table in TCAM-based schemes is generally simpler than that in trie-based schemes. Therefore, TCAMbased lookup has received much attention recently. However, challenges in TCAM-based lookup arise due to the following reasons: a) An IP address may match multiple prefixes in the forwarding table and the longest matching prefix (LMP) should be chosen; b) The 18MB TCAM can only operate at a speed of up to 266MHz and perform 133 millions lookup per second [4], which is barely enough to keep up with the 40Gbps line rate today; c) The size of the routing table has been increasing at a rate of about 10-50k entries per year in the past few years [5] and more storage is needed when IPv6 is widely deployed; d) The changes of prefixes in routing table can occur at the rate of approximately prefixes per second because of the instability of Internet. Frequent updates may consume many computation cycles in the IP lookup engine and result in the degradation of the whole system. e) High power consumption is also traditionally one of the major problems in building the lookup engine. Hence there exist three major issues to be considered to build a lookup engine: a) lookup performance; b) update time; c) power consumption. Existing parallel schemes [6], [7], [8], [9], [10] were put forward to partially improve the system efficiency. [6], [7] focus only on the update time; [8], [9], [10] focus on both lookup throughput and power consumption. It seems hard to deal with all three issues simultaneously. Motivated by this situation, in this paper, we propose a crossed address range division and shared caching scheme to improve the lookup performance, and use a buddy update method to improve the update efficiency. Besides, the scheme keeps the low power consumption by decreasing the number of the triggered TCAM access in each lookup operation. The main ideas of our scheme are as follows: 1) Calculate the TCAM chip number b needed on the basis of the forwarding table entries, chip size and percentage of free space on each chip in the initial state; 2) Partition the forwarding table IP prefixes into b parts roughly equal based on address range, and every part overlaps with its neighbors for traffic balance among them; 3) Update prefixes efficiently by well organizing prefixes and buddy update method; 4) Add a cache TCAM chip for leveraging the whole traffic, the cache mechanism proposed is based on traffic statistical feature. Our scheme is easy to be implemented and scalable and flexible. The rest of the paper is organized as follows. Related works on parallel TCAM schemes are described in section 2. In section 3 we propose our parallel scheme and describe the functionality of each component. We present our theoretical performance analysis in section 4 and simulation results in section 5. Finally, we conclude our work in section 6. II. RELATED WORKS Chip-level parallel TCAMs are deployed to circumvent the limitation of a single TCAM, but there also exist some problems: 1) traffic load balance; 2) update time; 3) power dissipation.

Fig. 1. Schematics of the complete implementation architecture Many researches have been conducted to optimize the performance of TCAM-based lookup engines.

2 Fig. 1. Schematics of the complete implementation architecture Many researches have been conducted to optimize the performance of TCAM-based lookup engines. The partition table is divided by output port into several smaller tables in [6]. The number of tables is equal to the number of output ports on the router. Each table holds a collection of all the entries that map to the output port it corresponds with. Since all entries in a partitioned table map to the same output port, there is no longer a need to keep the entries sorted. When a search occurs, each TCAM looks up the IP address in parallel. Each table outputs the matched lengths to a selection logic. After the selection logic chooses the longest length, the packet is forwarded to the output port based on which table had the longest prefix match. The basic idea of [7] is to partition all prefixes into different sets based on the relationship among them. For a given destination IP address, the prefixes which have ancestor-descendant relation will be matched simultaneously. The depth of a prefix search trie currently does not exceed 7 even including the default prefix so there can be at most 7 matches. If the forwarding table is partitioned into several TCAMs so that there is no ancestordescendant relation in each partitioned TCAM, then it is guaranteed that there exists at most one match in each TCAM. [6], [7] focus only on the update time. The update efficiency is O(1) of both schemes, but the lookup performance is poor and power consumption is high. Panigrahy et al. [8] partition all prefixes into 8 different parts equally based on the address range. They put each part in a TCAM chip. When a search occurs, each TCAM looks up the IP address in parallel based on address range. The packet is forwarded to the output port based on which table had the longest prefix match. Hence the lookup performance can be improved. The power consumption can be saved by decreasing the number of the triggered TCAMs access in each lookup operation, but the improved look performance is very limited due to no further efficient mechanism. To further improve the lookup performance, Zheng et al. [9] proposed a parallel lookup architecture in which each TCAM chip had a redundant partition to balance traffic load. It was assumed that the lookup traffic distribution among IP prefixes can be derived from the long-time statistics of traffic traces. However, the Internet traffic could be very bursty. Mapping routing table partitions into chips based on long-term traffic distribution cannot effectively balance the workload of individual chip during bursty period. When the traffic is temporarily biased to a limited number of routing prefixes, the multiple selectors will frequently access the same block, which will drastically hamper the whole throughput. To address this issue, Lin et al. [10] proposed a logical cache scheme. Instead of mapping routing table partitions into TCAM chips relying on long-term traffic statistics, they reserved a partition on each TCAM chip as a logical cache for dynamic load balancing, which can balance the traffic load adaptively. Both schemes [9], [10] neglect an important factor the update time when they try to improve the lookup and power efficiency. The inner chip partition mechanism of [9], [10] makes the whole system reconstruct when one partition on any of the parallel TCAM chip is full during update. The frequent reconstruction can decrease the whole system efficiency dramatically. The inefficient update of TCAM is the performance bottleneck of whole system. It is very hard to directly improve the update efficiency based on the parallel schemes in [9], [10]. How to redesign the whole parallel scheme to keep the high throughput, low power consumption and at the same time to achieve a fast update is the goal of our scheme. III. PROPOSED CHIP-LEVEL PARALLEL SCHEME The detailed implementation architecture of the parallel lookup engine is presented in Fig. 1. The basic ideas of our scheme focus on: 1) the number of entries on each TCAM chip is approximately equal and well organized for fast update and least reconstruction; 2) the distribution of prefixes on TCAM chip is based on address range, and the address range of a TCAM chip overlaps with its neighbors to balance their traffic; 3) a cache TCAM chip is reserved to hold popular destination IP addresses to balance all traffic. When an incoming packet arrives in, the destination IP

3 address is extracted and delivered to the indexing logic for a comparison. The indexing logic will return a partition number indicating which TCAM (or two neighbor TCAMs) may contain the matching prefixes and tag the packet with it. Before sent to the adaptive load balance logic, the packet will be tagged with a time stamp by the time stamp logic because the incoming packet can be processed in a non-fifo order due to different queuing length and cache missing. The packet is finally sent to the adaptive load balance logic to decide which TCAMs input queue should be sent to. The adaptive load balance logic will select the idlest one among the matching TCAMs and the cache TCAM. There are three rules: 1) If the queuing length of the matching TCAM is equal to the cache TCAM, the matching TCAM (so called home TCAM) has the priority; 2) If the queuing length of two matching TCAMs is equal it will select either one randomly; 3) Otherwise select the idlest one. After the stream reaches the TCAM, the result will be sent to the re-ordering logic which maintains the original sequence by using the time stamp attached. After re-ordering, it outputs the final result. Notice that when cache missing occurs, the packet should be sent back to the adaptive load balance logic instead of directly to the matching TCAM suggested in [10] because the packet maybe hit the cache in the next cycle. Our scheme tries to balance the traffic distribution on parallel TCAM chips and improve their update efficiency, so the following three key issues must be considered: 1) How to partition all prefixes to make them distribute among TCAM chips equally. 2) How to organize the entries on each TCAM to improve the update efficiency and how to update the TCAM chip. 3) How to balance the traffic on all TCAM chips. A. Partition Method The first thing in this approach is the partition rule. A good partition rule must meet two primary conditions: 1) It must create roughly equal partitions; 2) It must be simple so that the indexing logic becomes simple and fast. In other words, locating the target partition by given search key should be possible in one cycle using simple hardware. Generally three kinds of method for partitioning the entire routing table were proposed, i.e., key-id based [9], prefix Triebased [11] and range-based [8], [10] partitioning methods. The key-id approach suffered from uneven sub-table sizes and uncontrolled redundancy, which resulted in a higher memory and power cost. Trie-based partitioning can lower the redundancy and unify sub-table sizes, but it required an extra index TCAM to perform the block selection. As a result, two TCAM accesses were needed for each lookup request. Range-based partitioning can lower the redundancy and unify sub-table sizes to the most through splitting the routing table into multiple buckets with identical size according to the address range, which was introduced by Panigrahy et al. [8] and implemented by Lin et al. using pre-order splitting algorithm [10]. Thus, the range-based partitioning can satisfy condition 1). x s n w q k B0 Fig. 3. Fig. 2. B1 Schematics of the Indexing Logic B2 Prefix falling on multiple partitions B3 Prefix falling into a partition Dividing Prefix into 4 Equal Size Partitions TABLE I PARAMETERS IN OUR SCHEME TCAM number TCAM chip capability (overall prefixes number it can hold) TCAM chip initial occupancy percentage Whole prefix entries number Percentage of the overlapping area Entries number on each chip (excluding the overlapping area) Fig. 2 illustrates the structure of the indexing logic for TCAM chip selection. It includes pairs of parallel comparing logics and an index table. Each pair of parallel comparing logic corresponds to one TCAM chip and is composed of two registers which store the boundary address range of each TCAM partition. Next to the parallel comparing logics is an index table with encoder. The index table which stores the partition distribution information returns a partition number indicating which chip or two neighbor chips may contain the prefix matching the IP address by the encoded information. Because the data width of the indexing logic is fixed and only a simple compare operation is executed, it can work at very high speed. Hence range-based partitioning can also meet condition 2). Based on our analysis we use the range-based partitioning method to divide the routing table. Before partition we must confirm the partition number. Only one bucket/partition is put into a TCAM chip in our scheme, which means the bucket/partition number is equal to the TCAM chip number (except for the cache TCAM chip). This may achieve four advantages: 1) to improve the update efficiency by reducing the reconstruction times; 2) to reduce the TCAM cost by not needing the support of the partitiondisable technique; 3) to save the memory by reducing the repeated number of entries in each TCAM chip; 4) to reduce the registers number in the indexing logic. As Fig. 3 illustrated, there are at most 32 2 =64boundary B4

prefixes for IPv4 that need to be present in a TCAM chip (these prefixes fall into multiple ranges), because any prefix that falls into multiple ranges must be on the path from one of these

4 prefixes for IPv4 that need to be present in a TCAM chip (these prefixes fall into multiple ranges), because any prefix that falls into multiple ranges must be on the path from one of these boundaries to the root. Otherwise, it will strictly lie in the interior of one of the regions carved out by these paths. In our scheme, the overlapping entries between two neighbors are far more than 64, so we ignore 64 in the following calculation. The parameters of our scheme are illustrated in Table I. For the first and last TCAM chip, there are only q 2 overlapping areas, we can get: w = x s (n q 2 ) s q 2 (1) When designing the lookup engine, we can get the TCAM chip capability s and the prefix entries number w, and decide the TCAM initial occupancy percentage n and the percentage of the overlapping area q based on demands, then we can calculate the chip number x by (1). For example: if s =40k 36b, n = 50%,w = 200k, q = 10%, we can get x 11, in addition, there is one chip for cache with the same size. Hence we get 12 chips altogether. After that we need to calculate the entries number k in non-overlapping part on each chip, we may use (2) to calculate k. k = w + s q 2 s q (2) x 2 In the example above, we can get k 16756, in addition, there are 2048(= 10% 40k 2 ) entries in each overlapping area. If we calculate the 11 TCAMs numbers together: ) = There will be 4 entries left compared with the total entries number w (w = 200k = = ). We put the residual 4 entries in the last partition. The first and last chip initial occupancy percentages are: %, %; Other TCAM chips initial occupancy percentage is: %. We get a scalable and flexible scheme since we can get the parallel TCAM chip number based on all prefixes number and the TCAM chip capability. We can make every TCAM entries number roughly equal by (1),(2). We fill the TCAM chip with the pre-order splitting method originally proposed in [10]. The main function of the pre-order splitting method is to partition the prefixes into 11 parts roughly equal and record the address range of each part. B. Entries Organization and Update Mechanism As shown in Fig. 4, we put the buckets into TCAMs when getting buckets from pre-order splitting method. The overlapping part is put on the top of the TCAM and at the bottom of the next TCAM, the non-overlapping part is put in the middle of the TCAM. The size of two empty spaces is made equal for efficient update. Bucket1 corresponds to TCAM1,, bucket11 corresponds to TCAM11 respectively. Each partition overlaps with its neighbors, so we need to save two references the high and the low. The high indicates high boundary of the non-overlapping part and low boundary of the non-overlapping part; the low indicates low Fig. 4. Overlapping range-based TCAM architecture boundary of the non-overlapping part and high boundary of the non-overlapping part. We can use all references to divide the partition into two parts the overlapping part and the non-overlapping part, and put them into different position as illustrated above. From Fig. 4, we may draw several properties that will benefit for efficient update: 1) Entries in the TCAM are ordered with the length of IP prefix; 2) A small part of address range overlaps with its neighbors; 3) The position of the different part of the address range is well organized. We propose a buddy update method (Fig. 5) for the TCAM update based on the entries organization. It works as follows: The first case of prefix update is that the update happens in non-full TCAM. The prefix P will be compared with the boundary points to find corresponding partition. If nonoverlapping partition, the prefix P will be copied to the corresponding position at inserting and just be deleted at deleting. We must keep the IP prefix ordered during update, so we need to adjust entries position when inserting or deleting an entry. If the number of entries above the updating prefix P is greater than that of below, we move the below entries. On the contrary, we move the above. For prefix in overlapping area, the update operation should be done in both of the two neighbor TCAMs. During update when one of the two free spaces on each TCAM chip is exhausted we need to re-adjust

5 entries to make the size of two free spaces still equal. We could easily detect the update efficiency in this case. If the total number of the entries is n, and the TCAM s number is x, the update efficiency would be O(n/2x) in this case. The second case of prefix update is that the update happens in a full TCAM chip. The full chip needs to get free space from one of its neighbors in this case. The full chip will detect the free space of all its neighbors to find a larger one which we call it buddy. It is not difficult to balance the size of two adjacent TCAM chips dynamically using pre-order algorithm. New reference nodes and new address range of the two buddies are redecided after the pre-order algorithm. The two buddies can get equal free space size after adjustment. A threshold of the free space size will be set. The buddy adjustment should be done when the buddy s free space size is greater than the threshold. The reason we set a threshold is that we should not go on buddy adjustment when all the neighbors are nearly full, which is too inefficiency. The system has to be reconstructed when all neighbors free space size is less than the threshold. Since we have taken care of the future update at designing chip s initial occupancy percentage, it is very unlikely to have reconstruction happen. Normally in this situation we should notice whether the hardware needs to be updated to adapt to the growing prefixes. The update efficiency would be O(2n/x) in this case. C. Traffic Balance There are two mechanisms to balance traffic. The first is redundancy mechanism and the second is cache mechanism. Chip-level TCAM redundancy mechanism is derived originally from [9]. Different from [9], the redundancy mechanism in our scheme is proposed by overlapping the address range among neighbors. The redundancy mechanism may work well if traffic is evenly distributed or a little uneven, but the system performance will decline sharply for the bursty traffic. It is easy to think of using cache scheme to further improve the performance. A logical cache structure was put forward in [10], but the logical cache update will stop the lookup operation of the chip which will dramatically decrease the lookup efficiency due to frequent cache update. In our scheme we propose a shared cache TCAM to hinder this happen. Derived from the traditional cache schemes [12], [13], [14], a straightforward design with cache is to deploy a first stage caches working in front of a second stage data TCAMs. An obvious drawback of this conventional approach is two TCAM accesses were needed for each lookup request. Hence the cache is required to operate at N times the speed of TCAM if there are N parallel TCAMs in the system, which is impractical [10]. Different from the traditional cache schemes, our shared cache works in parallel with other TCAM chips, which means only one TCAM access is needed for each lookup request. The cache TCAM chip will not hinder the operation of other TCAM chips so the cache chip can have the same speed with other TCAM chips, which saves the hardware cost greatly. There are two kinds of algorithms for cache mechanisms proposed in route lookup: caching destination IP addresses Buddy_update (P,op) { /*Compare Prefix P with all boundary references to detect the home TCAMs need update */ for (i =0; i<=max(lrefence_index); i++){ if (P>=lReference_node[i](lowRange)) and (P<= hreference_node[i](highrange)) then{ //P TCAM[i] and P is in non-overlapping part tcam_update(p, i, op); if (P>=hReference_node[i](highRange)+1) and (P<= lreference_node[i](lowrange)+1) then{ //P TCAM[i] and P TCAM[i+1] P is in overlapping part of TCAMi and TCAMi+1 //we need update TCAMi and TCAMi+1 tcam_update(p, i, op); tcam_update(p,i+1,op); Tcam_update(P,id,op) { Search P and find its position; If (op= insert ) and (P exsits) then{ cover(p); //if p have exsited just cover the old entry Exit; If (tcam[id] is not full) then{ If (upper_space(size)=0) or (lower_space(size)=0) then Move entries to make upper_space(size)= lower_space(size); If upper_entry(p)>lower_entry(p) then{ If op= insert then{ move upper entries upwards; Insert(P); Else if op= delete then{ Delete(P) Move lower entries upwards; Else { // upper_ entry (P)<=lower_ entry(p) If op= insert then{ move lower entries downwards; Insert(P); Else if op= delete then{ Delete(P) Move upper entries downwards; Else{ // TCAM is full If tcam[i-1](size)> tcam[i+1](size) then buddy=i-1; Else buddy=i+1; If tcam[buddy](size)>threshold then Preorder-split(2); Adjust two buddies; Else{ Preorder-split(b); // b is all partitions number Reconstruct the whole system; Fig. 5. Buddy Update Method (DIPs) [12], [13], [14] and caching prefixes [15]. We use the caching DIPs method since caching prefixes needs to consider LMP. Although Mohammad et al. [15] put forward RRC-ME algorithm to tackle LMP, the immediate cache update in the event of cache missing is very difficult since the system has to evaluate the minimal expansion prefix decomposition of the corresponding prefix. Caching DIPs can also save hardware cost since the caching TCAM chip can be replaced by a cheeper Binary CAM chip if only caching DIPs. Another reason we use the caching DIPs is the high degree of locality of DIPs [14]. Based on the adaptive load balance logic rules in our scheme, we may see that the unusual DIPs are more likely to be sent to its home TCAM chip and the popular DIPs seems to be sent to the cache TCAM. If the traffic DIPs distribution is a heavy-tailed distribution which means large numbers of DIPs occurs fewer and a few DIPs occurs more times, our arbitration mechanism can obviously benefit from it. We need to study the traffic DIPs distribution for our cache

6 mechanism. The traces we used were collected from the core router of the China Educational & Research Network (CERNET) located at Tsinghua University from 2007/07/07 to 2007/07/17. CERNET is a national academic computer network. It is one of the biggest networks in China with over 18 millions users. There are over 1300 schools, research or government institutions among 31 provinces connected to CERNET via 8 regional network centers. The traces were collected by a special hardware device and only contain IP headers for saving space. Through long time statistics of the DIPs distribution from different traces, we found the similar distribution of different time and different time scales (less than half an hour, we do not test longer time scales). We plot the frequency versus the emerging times of DIPs in log-log scale in Fig. 6. Fig. 6 a) - d) present a distribution of 10000, , and continuous packets DIPs respectively. From Fig. 6 we notice that the DIPs distribution approximately follows power-law and the tail is much heavier than the power-law in different time scales, which implies that not only the high degree of temporal locality exists but also the temporal locality could last a long time and be very steady. We believe that DIPs lookup speed can be increased by taking advantage of the steady temporal locality. If an address is referenced, it will likely be used again soon and last for a period of time. From Fig. 6 d) we can see more than 10 5 DIPs occur only once and 1 DIP occurs 10 5 times. We propose the cache TCAM to capture commonly referenced addresses prefixes for fast lookup. Our statistics shows that we can get an average hit rate over 98% for the immediate update TCAM cache mechanism with LRU algorithm. One reason we propose the immediate update is the temporal locality, another reason is that in our scheme the packet should be sent back to the adaptive load balance logic when cache missing. The adaptive load balance logic may send the missing packet back to cache TCAM again, the packet may hit the cache in the next cycle by immediate update. On the contrary, the missing packet will be sent directly to the home TCAM in [10] due to their slow-update mechanism, because cache missing will always happen if the missing packet is sent to other TCAMs logical cache during the update interval. It will seriously increase the burden of the home TCAM during the slow-update interval due to temporal locality. IV. PERFORMANCE ANALYSIS A. Lookup Performance The speedup factor is 1 in the worst case in [9], [10]. The worst case in [9] happens when the bursty stream only hits one TCAM chip during an interval. The worst case in [10] happens when the bursty stream hits one TCAM chip and the matching prefixes are not in logical cache during an interval. Since the logical cache size is very limited and the cache mechanism adopted in [10] is slow-update, in which the cache can not be updated during the slow-update interval, cache missing happens all the time in the worst case. In our scheme the worst case also happens when bursty stream only hits one Freqency Freqency a) Consecutive Packets Times c) Consecutive Packets Times Fig. 6. Freqency Freqency b) Consecutive Packets Times d) Consecutive Packets Times The Destination IP address Distribution TCAM. The cache size in our architecture is big enough to hold all DIPs during that interval and the cache will be updated immediately when cache missing happens, which can assure that the cache TCAM will work in parallel with the home TCAM in the worst case, so the worst case speedup factor can be 2 for our scheme. The speedup factor of our scheme is apparently n (n indicates all TCAM chips number) in the best case as in [9], [10] when all TCAMs work in parallel. The speedup factor in average case depends on the incoming traffic s feature. Zheng et al. [9] use queuing theory to model the lookup engine and assume that the arrival process of the incoming stream is a Poisson process, which is not the case for real Internet traffic [16]. Lin et al. [10] calculated the lower bound speedup factor which can reach n 1 on the basis that the average hit rate is no less than n 2 n 1. Here we will not give a theoretical analysis of the speedup factor in the average case because it is very hard for real traffic. We will give an experimental result for the average speedup factor in section 5. B. Update Efficiency One TCAM chip is divided into several buckets in [9], [10]. The whole system has to reconstruct when any bucket is full on any chip even if there are many free spaces in other buckets on the same chip, which will drastically hamper the whole system s performance. If the total number of entries is n, and the buckets number is y, the update efficiency is O(n/y) for inner-bucket update. The system has to reconstruct for interbucket update. There is only one bucket on each chip in our system; the system only needs inner-chip update before the corresponding chip reaches full state. Even if the chip is full we can use buddy adjustment to get free space from its buddy. The reconstruction of our system is very unlikely to happen, which will increase the update efficiency of the whole system dramatically. If the total number of entries is n, and the TCAM s number is x,

TABLE II TRACES AND RESULTS Trace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Time(h) 6:00 10:00 12:00 13:00 14:00 15:00 17:30 18:00 18:30 19:00 19:30 20:00 21:00 23:00 23:30 Speedup Factor 11.03 11.51 11.

7 TABLE II TRACES AND RESULTS Trace Time(h) 6:00 10:00 12:00 13:00 14:00 15:00 17:30 18:00 18:30 19:00 19:30 20:00 21:00 23:00 23:30 Speedup Factor a)trace1 on chips b)trace2 on chips c)trace3 on chips d)speedup Factor Distribution of Trace1-3 Fig. 7. Trace1-3 Distribution on Each TCAM and Speedup Factors a)trace4-6 b)trace7-9 c)trace10-12 d)trace Fig. 8. Trace4-15 Speedup Factors Distribution the update efficiency is O(n/2x) for inner-chip update and O(2n/x) for inter-chip update on average. C. Power Consumption There are two methods to decrease the number of entries triggered during a lookup operation. One is to store the entries in multiple small chips instead of a single large one. The other is to partition one chip into small blocks and trigger only one of them during a lookup operation, which needs the hardware (TCAM) to support the partition-disable function. Our proposed scheme uses the first method and needs not the support of the partition-disable function. Let n denotes the chips number, and x denotes the parallel working chips number. Assume the power consumption for TCAM without partitioning search is P, the power consumption should be Px/n at any time. The max power consumption is P when all chips work in parallel, and the lookup speedup factor is apparently n in this case. We may notice that the system power consumption is in proportion to the system throughput. The schemes in [9], [10] use the second method to decrease the power consumption. Let s denotes the whole blocks number in all chips, and t denotes the parallel working blocks number. The power consumption should be Ps/t at any time in their schemes. The power consumption in our scheme is a little higher compared with schemes of [9], [10], but we can get much more advantages: 1) the system cost is saved as the hardware does not need the partition-disable function support and has more simple indexing logic; 2) the most important advantage is that the reconstruction times is sharply decreased. V. PERFORMANCE EVALUATIONS A. Experiment Setup We conduct a series of experiments and simulations on matlab to measure the system s performance and adaptability to different types of traffic load distributions. we conduct experiment by a program to simulate the process of our proposed forwarding engine. we use real traffic traces and route table from a core router in CERNET to evaluate the performance. The packet traces used are collected on the core router of the CERNET from 2007/07/07 to 2007/07/17, and the router s route table size is about 200k. According to our analysis in Section III-A we can partition the route table into 11 parts using pre-order splitting method and put them into TCAM chips as illustrated in Fig. 4. Assume the input queue length is 10 and the speed is stationary for a traffic trace,

8 to evaluate the speedup factor we can approximately divide the whole serving process into discrete subprocesses. The first subprocess handles the first 120 packets of given trace, the second subprocess handles the next 120 packets which are composed of the remaining packets of the first subprocess and the successive packets of the trace, etc. We evaluate the speedup factor of the trace by calculating the average value of all subprocesses speedup factors. B. Results We use the first day s traces to illustrate. Table II presents 15 ten-minute traces collected in 2007/07/07 at different time. Fig. 7 a)-c) show the traffic distribution on the parallel TCAM chips of the first three traces respectively. The ideal condition is that each TCAM chip is responsible for handling 10 packets for the 120 packets, which the lookup speedup factor can achieve 12 compared with non-parallel schemes. For real conditions, from Fig. 7 a)-c) we can see that the remaining packet number of trace1 is 10, trace2 is 7 and trace3 is 8 after the same time passed in the ideal condition. In addition, there are 10 (1 98%) = 0.2 cache missing packets on the cache TCAM chip for each trace. The total remaining numbers are 10.2, 7.2, 8.2 for each trace, so the speedup factors are: = 10.98, = 11.28, = for trace1-3. We continue this process until all packets of trace1-3 are finished. Fig. 7 d) presents the speedup factors distribution of the first 200 subprocesses of trace1-3. Fig. 8 a)-d) present the speedup factors distribution for the remaining traces. We calculate the average value for each trace. The average speedup factor of each trace is presented in Table II. The result shows that all traces speedup factor is greater than 11. Another interesting phenomenon we can detect from the result is that the working hour traces speedup factor is greater than that of the spare time. The possible reason may be that more flows superpose together in working hours which make DIPs distribution of the traffic more dispersive compared with spare time. We have tested traces collected in 10 days, and the test results show that the system can achieve an average lookup speedup factor greater than 11. The lookup performance of our scheme is about the same as the scheme in [10] and superior to the scheme in [9]. Therefore, our scheme is able to achieve a well dynamic load-balance on the occurrence of internet bursty traffic. VI. CONCLUSION Increasing the lookup throughput and improving the update efficiency while reducing power consumption are the three primary issues in this paper. Our main contributions are as follows: 1) Proposed an efficient buddy update method for the well organized entries on TCAM chips; 2) Put forward a crossed address range division and shared caching scheme for dynamic traffic balance; 3) Observed the distinctive feature of DIPs distribution and used the feature for our cache mechanism. The lookup performance of our scheme can be further improved if the cache TCAM chip can be faster than others. The power consumption of our scheme was decreased by reducing the number of the triggered TCAM access during a lookup operation. The power consumption may be a little more than the schemes in [9], [10], but the update efficiency was greatly improved and the cost was also saved compared with schemes in [9], [10]. Our scheme is flexible and scalable, and can be easily extended for IPv6 address lookup. Our test results show that under real traffic traces, the proposed scheme can achieve an average lookup speedup factor greater than 11 with 12 TCAM chips, on the cost of 10% more memory space. ACKNOWLEDGMENT The authors would like to express their appreciations to Bin Liu [9], [10], Dong Lin [10] and Hui Wang for their discussions and suggestions. This work is supported by the National Basic Research Program of China under Grant No. 2009CB320505, the National Science and Technology Supporting Plan of China under Grant No. 2008BAH37B05, and the National High-Tech Research and Development Plan of China under Grant No. 2008AA01A303 and 2009AA01Z251. REFERENCES [1] M. Ruiz-Sanchez, E. Biersack, and W. Dabbous, Survey and taxonomy of ip address lookup algorithms, IEEE Network, vol. 15, pp. 8 23, Mar.-Apr [2] N. Huang, M. Zhao, and J. Pan, A fast ip routing lookup scheme for gigabits switching routers, in IEEE INFOCOM, [3] D. Yu and B. S. B. Wei, Forwarding engine for fast routing lookups and updates, in IEEE GLOBECOM, [4] CYRESS. [Online]. Available: [5] G. Huston. Route table analysis reports. [Online]. Available: [6] E. Ng and G. Lee, Eliminating sorting in ip lookup devices using partitioned table, in The 16th IEEE International Conf. on Application Specific Systems, Architecture and Processors, [7] K. Jinsoo and K. Junghwan, An efficient ip lookup architecture with fast update using single-match tcams, in WWIC, [8] R. Panigrahy and S. Sharma, Reducing tcam power consumption and increasing throughput, in IEEE HotI 10, [9] K. Zheng, C. Hu, H. Liu, and B. Liu, An ultra-high throughput and power efficient tcam-based ip lookup engine, in IEEE INFOCOM, [10] D. Lin, Y. Zhang, C. Hu, B. Liu, X. Zhang, and D. Pao, Route table partitioning and load balancing for parallel searching with tcams, in The 21st IEEE International Parallel and Distributed Processing Symposium, [11] F. Zane, G. Narlikar, and A. Basu, Coolcams: Power-efficient tcams for forwarding engines, in IEEE INFOCOM, vol. 1, Mar./Apr. 2003, pp [12] W. Shyu, C. Wu, and T. Hou, Efficiency analyses on routing cache replacement algorithms, in IEEE International Conference on Communications, [13] T. Chiueh and P. Pradhan, High-performance ip routing table lookup using cpu caching, in IEEE INFOCOM, [14] B. Talbot, T. Sherwood, and B. Lin, Ip caching for terabit speed routers, in IEEE GLOBECOM, [15] J. Mohammad, Akhbarizadeh, and M. Nourani, Efficient prefix cache for network processors, in the 12th Annual IEEE Symposium on High Performance Interconnects, [16] V. Paxson and S. Floyd, Wide-area traffic: The failure of poisson modeling, IEEE/ACM Transactions on Networking, vol. 1, pp , 1995.

Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs I

Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs I Dong Lin 1, Yue Zhang 1, Chengchen Hu 1, Bin Liu 1, Xin Zhang 2, and Derek Pao 3 1 Dept. of Computer Science and Technology