ENERGY EFFICIENT INTERNET INFRASTRUCTURE

Size: px

Start display at page:

Download "ENERGY EFFICIENT INTERNET INFRASTRUCTURE"

Christal Noreen Jones
5 years ago
Views:

1 CHAPTER 1 ENERGY EFFICIENT INTERNET INFRASTRUCTURE Weirong Jiang, Ph.D. 1, and Viktor K. Prasanna, Ph.D. 2 1 Juniper Networks Inc., Sunnvyale, California 2 University of Southern California, Los Angeles, California 1.1 INTRODUCTION Internet is built as a packet-switching network. The kernel function of Internet infrastructure, including routers and switches, is to forward the packets that are received from one subnet to another subnet. The packet forwarding is accomplished by using the header information extracted from a packet to look up the forwarding table maintained in the routers/ switches. Due to rapid growth of network traffic, packet forwarding has long been a performance bottleneck in routers/ switches.

2 2 ENERGY EFFICIENT INTERNET INFRASTRUCTURE Figure 1.1 shows the block diagram of a modern Internet router. A router contains two main architectural components: a routing engine and a packet forwarding engine. The routing engine on the control plane processes routing protocols, receives inputs from network administrators, and produces the forwarding table. The packet forwarding engine on the data plane receives packets, matches the header information of the packet against the forwarding table to identify the corresponding action, and applies the action for the packet. The routing engine and the forwarding engine perform their tasks independently, although they constantly communicate through high-throughput links [26]. Control Plane Admin Input Routing Protocol Process Admin Output Forwarding Table Updates Routing Engine Forwarding Table Routing Protocol Packets Packet_in Forwarding Engine Packet_out Data Plane Figure 1.1 Block diagram of the router system architecture The core function of network routers is IP lookup, where the destination IP address of each packet is matched against the entries in the routing table. Each routing entry consists of a prefix and its corresponding next-hop interface. Table 1.1 shows a simple routing table where we assume 8-bit IP addresses. A prefix in the routing table represents a subset of IP addresses that share the same prefix, and the prefix length is denoted by the number followed by the slash. The nature of IP lookup is longest prefix matching (LPM) [41]. In other words, an IP address may match multiple prefixes, but only the longest

3 INTRODUCTION 3 prefix is used to retrieve the next-hop information. For example, a packet with destination IP address will match the prefixes 110* and 11* in Table 1.1. But 110* becomes the LPM. Therefore, that packet is forwarded to the corresponding next-hop interface, P6. Table 1.1 Example IP lookup table Prefix/Length Next-Hop Interface /1 P /3 P /2 P /3 P /2 P Performance Challenges As the Internet becomes even more pervasive, performance of the network infrastructure that supports this universal connectivity becomes critical with respect to throughput and power. Traditionally, performance has been achieved by increasing the maximum network throughput to handle bursty traffic. The harsh truth about the power-throughput relationship of today s network infrastructures is that power efficiency is often sacrificed in order to obtain higher throughput through brute-force expansion. A single high-end core router that switches 640 Gbps full-duplex network traffic can consume over 10 kw of power [9, 25], whereas a high-end service gateway capable of Gbps routing and firewall throughput can take 1 5 kw of power [8, 24]. Both the large amount of total energy and the high power density cause serious problems for the industry and the environment through the operation and maintenance of these network equipments: The historical trend (Figure 1.2) shows that the capacity of backbone routers had doubled every 18 months until 5 years ago. Today, terabit

4 ENERGY EFFICIENT INTERNET INFRASTRUCTURE routers with 10 15 kw are at the limit due to the power density.

The high power density imposes a strenuous burden on the cooling of the network equipment. According to [44], a hardware component that consumes 50 100 W/ft 2 can require 1.3x to 2.

4 4 ENERGY EFFICIENT INTERNET INFRASTRUCTURE routers with kw are at the limit due to the power density. As a result, a thirty-fold shortfall in capacity will be seen by 2015 as compared to the historical trend for single rack routers [36]. Figure 1.2 Router capacity limited by power [36]. The high power density imposes a strenuous burden on the cooling of the network equipment. According to [44], a hardware component that consumes W/ft 2 can require 1.3x to 2.3x more power for its cooling. In other words, every wattage that is saved from the critical operation of the network equipment reduces up to 3.3 watts of total power dissipation. The high power and cooling also implies high monetary investments and energy costs. This fact is especially true for network infrastructures where the equipment (nodes) often need to be placed strategically near the center of metropolitan areas. At 500 watts per square foot, it will cost $5000/ft 2, or $250 million in total, to equip a 50,000-square-foot facility as a data center [2]. Some recent investigations [36, 7] show that power dissipation has become the major limiting factor for next-generation routers and predict that expensive liquid cooling may be needed in future routers. Packet forwarding has

5 INTRODUCTION 5 been a major performance bottleneck for network infrastructure [13]. Power/ energy consumption by forwarding engines has become an increasingly critical concern [47, 14]. Recent analysis by researchers from Bell labs [36] reveals that almost two thirds of power dissipation inside a core router is due to IP forwarding engines Existing Packet Forwarding Approaches Software Approaches The nature of IP lookup is longest prefix matching (LPM). The most common data structure in algorithmic solutions for performing LPM is some form of trie [41]. A trie is a binary tree, where a prefix is represented by a node. The value of the prefix corresponds to the path from the root of the tree to the node representing the prefix. The branching decisions are made based on the consecutive bits in the prefix. If only one bit is used to make branching decision at a time, then a trie is called a uni-bit trie. The prefix set in Figure 1.3 (a) corresponds to the uni-bit trie in Figure 1.3 (b). For example, the prefix 010* corresponds to the path that starts at the root and ends in node P3: first a left-turn (0), then a right-turn (1), and finally a turn to the left (0). Each trie node contains two fields: the represented prefix and the pointer to the child nodes. By using the optimization called leaf-pushing [46], each node needs only one field: either the pointer to the next-hop address or the pointer to the child nodes. Figure 1.3 (c) shows the leaf-pushed uni-bit trie that is derived from Figure 1.3 (b). Given a leaf-pushed uni-bit trie, IP lookup is performed by traversing the trie according to the bits in the IP address. When a leaf is reached, the prefix associated with the leaf is the longest matched prefix for that IP address. The corresponding next-hop information of that prefix is then retrieved. The time to look up a uni-bit trie is equal to the prefix length. The use of multiple bits in one scan can increase the search speed. Such a trie is called a multibit trie. The number of bits scanned at a time is called the stride. Figure 1.3 (d) shows the multi-bit trie for the prefix entries in Figure 1.3 (a). The root node uses a stride of 3, while the node that contains P3 uses a stride

6 6 ENERGY EFFICIENT INTERNET INFRASTRUCTURE Figure 1.3 (a) Prefix set; (b) Uni-bit trie; (c) Leaf-pushed trie; (d) Multi-bit trie. of 2. Multi-bit tries that use a larger stride usually result in a much larger memory requirement, while some optimization schemes have been proposed for memory compression [13, 45]. The well-known tree bitmap algorithm [13] uses a pair of bit maps for each node in a multi-bit trie. One bit map represents the children that are actually present and the other represents the next hop information that is associated with the given node. Children of a node are stored in consecutive memory locations, which allows each node to use just a single child pointer. Similarly, another single pointer is used to reference the next hop information that is associated with a node. This representation allows every node in the multi-bit trie to occupy a small amount of memory Hardware Approaches With the advances in optical networking technology, link rates in Internet infrastructure are being pushed from OC-768 (40 Gbps) to even higher rates [50]. Such high rates demand that IP lookup

7 SRAM-BASED PIPELINED IP LOOKUP ARCHITECTURES: ALTERNATIVE TO TCAMS 7 in routers must be performed in hardware. For instance, 40 Gbps links require a throughput of 8 ns per lookup for a minimum size (40 bytes) packet. Such throughput is impossible using existing software-based solutions [41, 16]. Most hardware-based solutions for high speed IP lookup fall into two main categories: TCAM (ternary content addressable memory)-based and DRAM/SRAM (dynamic/static random access memory)-based solutions. Although TCAM-based engines can retrieve IP lookup results in just one clock cycle, their throughput is limited by the relatively low speed of TCAMs. They are expensive and offer little flexibility for adapting to new addressing and routing protocols [48]. As shown in Table 1.2, SRAM outperforms TCAM with respect to speed, density and power consumption. However, traditional SRAM-based solutions, most of which can be regarded as some form of tree traversal, need multiple clock cycles to complete a lookup. For example, trie [41], a tree-like data structure representing a collection of prefixes, is widely used in SRAM-based solutions. It needs multiple memory accesses to search a trie to find the longest matched prefix for an IP packet. Table 1.2 Comparison of TCAM and SRAM technologies TCAM (18 Mb chip) SRAM (18 Mb chip) Maximum clock rate (MHz) 250 [18] 450 [11, 43] Cell size (# of transistors per bit) [1] 16 6 Power consumption (Watts) [54] 0.1 [5] 1.2 SRAM-BASED PIPELINED IP LOOKUP ARCHITECTURES: ALTERNATIVE TO TCAMS Several researchers have explored pipelining in order to improve significantly the throughput of SRAM-based IP lookup engines. Taking trie-based solutions as an example, a simple pipelining approach is to map each trie level

8 8 ENERGY EFFICIENT INTERNET INFRASTRUCTURE onto a pipeline stage with its own memory and processing logic. One IP lookup can be performed every clock cycle. However, this approach results in unbalanced trie node distribution over the pipeline stages. Memory imbalancing has been identified as a dominant issue for pipelined architectures [15, 4]. In an unbalanced pipeline, the fattest stage, which stores the largest number of trie nodes, becomes a bottleneck. It adversely affects the overall performance of the pipeline for the following reasons: First, it needs more time to access the larger local memory. This leads to reduction in the global clock rate. Second, a fat stage results in many updates, due to the proportional relationship between the number of updates and the number of trie nodes stored in that stage. Particularly during the update process caused by intensive route insertion, the fattest stage can also result in memory overflow. Furthermore, since it is unclear at hardware design time which stage will be the fattest, memory with the maximum size must be allocated for each stage. This results in memory wastage. Basu et al. [4] and Kim et al. [29] both reduce the memory imbalance by using variable strides to minimize the largest trie level. However, even with their schemes, the size of the memory of different stages can have a large variation. As an improvement upon [29], Lu et al. [34] proposes a tree-packing heuristic to further balance the memory, but it does not solve the fundamental problem of how to retrieve one node s descendants that are not allocated in the following stage. Furthermore, a variable stride multi-bit trie is difficult for hardware implementation especially if incremental updating is needed [4]. Baboescu et al. [3] propose a Ring pipeline architecture for trie-based IP lookup. The memory stages are configured in a circular, multi-point access pipeline, so that lookups can be initiated at any stage. The trie is split into many small subtries of equal size. These subtries are then mapped to different stages to create a balanced pipeline. Some subtries have to wrap around if their roots are mapped to the last several stages. Though all IP packets enter the pipeline from the first stage, their lookup processes may be activated at different stages. Hence, all the IP lookup packets must traverse the pipeline

9 SRAM-BASED PIPELINED IP LOOKUP ARCHITECTURES: ALTERNATIVE TO TCAMS 9 twice to complete the trie traversal. The throughput is thus 0.5 lookups per clock cycle. Kumar et al. [31] extend the circular pipeline with a new architecture called the Circular, Adaptive and Monotonic Pipeline (CAMP). It has multiple entrance and exit points, so that the throughput can be increased at the cost of output disorder and delay variation. It employs several request queues to manage access conflicts between the new request and the one from the preceding stage. It can achieve a worst-case throughput of 0.8 lookups per clock cycle, while maintaining balanced memory across pipeline stages. Due to the non-linear structure, neither the Ring pipeline nor CAMP under worst cases can maintain a throughput of one lookup per clock cycle. Also, neither of them properly supports the write bubble proposed in [4] for the incremental route update. Jiang et al. [19, 23] proposes a fine-grained nodeto-stage mapping scheme for linear pipeline architectures. It is based on the heuristic that allows any two nodes on the same level of a trie to be mapped onto different stages of a pipeline. Balanced memory distribution across pipeline stages is achieved, while a high throughput of one packet per clock cycle is sustained. The linear pipeline architecture supports incremental route updates without disrupting the ongoing IP lookup operations. Figure 1.4 shows the mapping result for the uni-bit trie in Figure 1.3 (b) using the fine-grained mapping scheme. To allow two nodes on the same trie level to be mapped to different stages, each node stored in the local memory of a pipeline stage has two fields. One is the memory address of its child node in the pipeline stage where the child node is stored. The other is the distance to the pipeline stage where the child node is stored. When a packet is passed through the pipeline, the distance value is decremented by 1 when it goes through a stage. When the distance value becomes 0, the child node s address is used to access the memory in that stage.

10 10 ENERGY EFFICIENT INTERNET INFRASTRUCTURE root Stage 1 Stage 2 Stage 3 Stage 4 P3 P1 P6 P2 P7 P8 Stage 5 P4 P5 Figure 1.4 pipeline [19, 23]. Mapping the uni-bit trie (shown in Figure 1.3 (b)) onto the linear 1.3 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY Though SRAM-based pipeline architectures have been proposed as a promising alternative to power-hungry TCAMs for IP lookup engines in next-generation routers [6, 23], they may still suffer from high power/ energy consumption, due to the large number of memory accesses for each IP lookup [52]. The overall power consumption for each IP lookup in SRAM-based engines can be expressed as in Equation 1.1: H P ower overall = [P m (S i, ) + P l (i)] (1.1) i=1 Here, H denotes the number of memory accesses, P m (.) the function of the power dissipation of a memory access (which usually has a positive correlation with the memory size), S i the size of the ith memory that is being accessed, and P l (i) the power consumption of the logic that is associated with the ith memory access. Since the logic dissipates much less power than the memories in the memory-dominant architectures [31, 39], the main focus of our work is on reducing the power consumption of the memory accesses. Note that the power consumption of a single memory is affected by many other factors, such as the fabrication technology and sub-bank organization, which are beyond the scope of our work. Since the energy consumption for each memory access is the product of the power consumption and the memory access time, the energy consumption per IP lookup is identical to its power consumption times clock period. In this

11 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY 11 chapter, power, if not specified, refers to the power/ energy consumption per IP lookup Problem Formulation Little work has been done on data structure optimization for power-efficient SRAM-based IP lookup engines. In this section we focus on fixed-stride multibit tries where all nodes at the same level have the same stride. Fixed-stride multi-bit tries are attractive for hardware implementation due to their ease for route update [13]. We use the following notations. Let W denote the maximum prefix length. W = 32 for IPv4. Let S = {s 0, s 1,, s k 1 } denote the sequence of strides for building a k-level multi-bit trie. Let S denote the number of strides in k 1 S ( S = k). 0 s i = W. Considering the hardware implementation for tree-bitmap-coded multi-bit tries, we cap the length of strides at s i < B s, i = 0, 1,, k 1, where B s is a predefined parameter, called the stride bound Non-Pipelined and Pipelined Engines A SRAM-based non-pipelined IP lookup engine stores the entire trie in a single memory. Any IP lookup may need to access the memory multiple times. Hence, the worst-case power consumption of a SRAM-based non-pipelined IP lookup engine can be modeled by Equation (1.2), where P ower memory and P ower logic denote the power consumption of the memory and of the logic, respectively. P ower = (P ower memory + P ower logic ) k (1.2) The logic dissipates much less power than the memories in the memoryintensive architectures [31, 17, 22]. For example, [22] shows that the memory dissipates almost an order of magnitude higher power than the logic in FPGA implementation of a pipelined IP lookup engine. Thus, we do not consider the power consumption of the logic. The optimal stride problem can be formulated as: min min P m(m(s(k))) k (1.3) k=1,2,,w S(k)

12 12 ENERGY EFFICIENT INTERNET INFRASTRUCTURE where M(S) denotes the memory requirement of the multi-bit trie built using S. P m (M) is the power function of the SRAM of the size M. For a SRAM-based pipelined IP lookup engine, its worst-case power consumption can be modeled by Equation (1.4), where H denotes the pipeline depth, i.e. the number of pipeline stages. P ower memory (i) and P ower logic (i) denote the power consumption of the memory and of the logic in the ith stage, respectively. P ower = H [P ower memory (i) + P ower logic (i)] (1.4) i=1 Similar to the non-pipelined engine, we omit the power consumption of the logic. Also, assuming that the memory distribution across the pipeline stage is balanced, the optimal stride problem can be formulated as: { min [P m ( M(S) } )] max ( S, H) S H (1.5) where M(S) denotes the memory requirement of the multi-bit trie built using S. P m (M) is the power function of the SRAM of the size M. The number of memory accesses is determined by the S and H. When H < S, multiple clock cycles are needed to access a stage. To achieve high throughput, we let H S. Since S = k W, we can rewrite (1.5) to be: min min [P m( M(S(k)) )] H (1.6) k=1,2,,w S(k) H To solve (1.3) and (1.6), we can first fix k and find the optimal S(k) so that the power consumption is minimized for the given k. Then, we compare the power consumption for different k to obtain the overall optimal S Power Function of SRAM Before we solve the above optimization problem, we need to figure out the power function of the SRAM with respect to its size M: P m (M). There is some published work on comprehensive power models of SRAM [12, 32, 49]. But these detailed white box models do not show the direct relationship between the power consumption and the memory

13 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY 13 size. We use CACTI tool [49] to evaluate both the dynamic and the static power consumption of SRAMs of different sizes and then obtain the function parameters through curve fitting ( black box modeling). According to [12, 32], when the word width is constant, both the highlevel dynamic and static power consumption of SRAMs can be approximately represented in the form of: P (M) = A M B (1.7) where M is the memory size, and A and B are the parameters whose values are different for dynamic and static power. We vary the SRAM size from 256 bytes to 8 Mbytes while keeping the word width to be 8 bytes, and obtain their power consumption using CACTI tool [49]. After curve fitting, we obtain that A dynamic = , B dynamic = 0.50, A static = , and B static = The results from CACTI and curve fitting are both shown in Figure 1.5. Power (Watts) Power vs. SRAM size dynamic_power_cacti dynamic_power_fitting static_power_cacti static_power_fitting SRAM size (bytes) x 10 6 Figure 1.5 Power vs. SRAM size Hence P m (M) = A dynamic M B dynamic +A static M Bstatic 10 6 (207M M). Then, (1.3) and (1.6) become (1.8) and (1.9), respectively: min min k=1,2,,w S(k) (207M(S(k)) M(S(k))) k (1.8) min min k=1,2,,w S(k) (207M(S(k))0.5 H M(S(k))) (1.9)

14 14 ENERGY EFFICIENT INTERNET INFRASTRUCTURE For a given k, when M(S(k)) is minimized, the power consumption is also minimized. Thus, the above problems can be reduced to finding the optimal stride so that the memory requirements are minimized Special Case: Uniform Stride In the original tree bitmap paper [13], the authors suggest using the same stride for all the nodes except for the root node. We call such a special fixedstride multi-bit trie as a multi-bit trie with uniform stride. The stride used by the root, s 0, is called the initial stride. Given k, we can find the optimal S by exhaustive search over different initial strides. In each iteration, s i =, i = 1, 2,, k 1. W s 0 k Dynamic Programming Srinivasan and Varghese [46] have developed a dynamic programming based solution to minimize the memory requirement of a k-level multi-bit trie. Sahni and Kim [42] made further improvement to reduce the complexity of the algorithms. However, those algorithms focused on the naive implementation of multi-bit tries, without considering the tree bitmap coding technique [13] for compressing the memory requirement of multi-bit tries. Similar to [46] and [42], we use the following notations. O denotes the uni-bit trie for the given set of prefixes. nnode(i) denotes the number of nodes at the ith level of O. np ref ix(i, j) denotes the total number of prefixes contained between the ith and the jth levels of O. T (j, r), r j +1, denotes the cost (the memory requirement) of the best way to cover levels 0 through j of O using exactly r expansion levels.

15 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY 15 The dynamic programming recurrence for T is: T (j, r) = j 1 min m=max (r 2,j B s) {T (m, r 1)+ nnode(m + 1) 2 Bs 2/32+ (1.10) nnode(j + 1) + np refix(m + 1, j)} T (j, 1) = 2 j+1 (1.11) Note that all the strides except the initial stride are capped by the stride bound (B s ). In hardware implementation, the length of the bitmaps is determined by B s. Algorithm F ixedstride(w, k), as shown in Figure 1.6, computes T (W 1, k), which is the minimum memory requirement to build a k-level treebitmap-coded multi-bit trie. The complexity of Algorithm F ixedstride(w, k) is O(k W B s ). After obtaining T (W 1, k), we can follow the track of the corresponding M(, ) to find the optimal S in O(k) time Performance Evaluation We used seventeen real-life backbone routing tables from the Routing Information Service (RIS) [40]. Their characteristics are shown in Table 1.3. Note that the routing tables rrc08 and rrc09 are much smaller than others, since the collection of these two data sets ended on September 2004 and February 2004, respectively [40]. We conducted the experiments for both non-pipelined and pipelined architectures. We evaluated the impacts of different architecture parameters on the power-optimal design of the data structure. The architecture parameters include the stride type, the stride bound, and the pipeline depth. For fixed-stride tree-bitmap-coded multi-bit tries, two stride types are considered. The first uses uniform strides, as described in Section The second uses optimal strides whose value is capped by B s, as discussed in Section Results for Non-Pipelined Architecture First, we set B s = 4 and examined the results for the non-pipelined architecture using uniform and op-

16 16 ENERGY EFFICIENT INTERNET INFRASTRUCTURE Input: O Output: T (W 1, k), S(k) 1: // Compute T (W 1, k) 2: for j = 0 to W 1 do 3: T (j, 1) = 2 j+1 4: end for 5: for r = 2 to k do 6: for j = r 1 to W 1 do 7: mincost = MaxV alue 8: for m = max (r 2, j B s ) to j 1 do 9: cost = T (m, r 1) + nnode(m + 1) 2 B s 2 + nnode(j + 1) + np refix(m + 1, j) 10: if cost < mincost then 11: T (j, r) = mincost 12: M(j, r) = m 13: end if 14: end for 15: end for 16: end for 17: // Compute S(k) = {s i }, i = 0, 1,, k 1 18: m = W 1 19: for r = k to 1 do 20: s r 1 = m M(m, r) 21: m = M(m, r) 22: end for Figure 1.6 Algorithm: F ixedstride(w, k).

17 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY 17 Table 1.3 Representative routing tables (snapshot on 2009/04/01) Routing table # of prefixes # of prefixes w/ length < 16 rrc (0.79%) rrc (0.83%) rrc (0.78%) rrc (0.83%) rrc (0.81%) rrc (0.84%) rrc (0.82%) rrc (0.84%) rrc (0.59%) rrc (0.75%) rrc (0.83%) rrc (0.83%) rrc (0.83%) rrc (0.81%) rrc (0.83%) rrc (0.79%) rrc (0.82%)

18 18 ENERGY EFFICIENT INTERNET INFRASTRUCTURE timal strides. The results are shown in Figure 1.7. In both cases, the power was minimized when k = 5 and S = 16, 4, 4, 4, (a) Uniform Stride 160 (b) Optimal Stride (B s = 4) Watts Watts k k Figure 1.7 Power results of the non-pipelined architecture using (a) the uniform stride and (b) the optimal stride. 250 (a) Optimal Stride (B s = 2) 350 (b) Optimal Stride (B s = 6) Watts Watts k k Figure 1.8 (b) B s = 6. Power results of the non-pipelined architecture using (a) B s = 2 and Then we varied the stride bound (B s ). Figure 1.8 shows the results by using two different stride bounds: B s = 2 and B s = 6 for the architecture that uses optimal stride. For B s = 2, the power was minimized when k = 9

19 DATA STRUCTURE OPTIMIZATION FOR POWER EFFICIENCY 19 and S = 16, 2, 2, 2, 2, 2, 2, 2, 2. For B s = 6, the minimal power was achieved when k = 4 and S = 17, 5, 5, Results for Pipelined Architecture Figure 1.9 shows the power consumption of the pipelined architecture using the two stride types. The pipeline depth (h) was set to be equal to k. The stride bound B s was 4. Both cases achieved the optimal power performance when k = 6 and S = 13, 4, 4, 4, 4, 3. 6 (a) Uniform Stride 8 (b) Optimal Stride (B s = 4) Watts 4 3 Watts h h Figure 1.9 and (b) the optimal stride. Power results of the pipelined architecture using (a) the uniform stride Figure 1.10 shows the results by using different stride bounds for the optimal stride. We set h = k. For B s = 2, the power was minimized when k = 9 and S = 16, 2, 2, 2, 2, 2, 2, 2, 2. For B s = 6, the minimal power was achieved when k = 5 and S = 16, 4, 4, 4, 4. We also conducted experiments using different pipeline depths. Both cases achieved the minimal power consumption when k = 6 and S = 13, 4, 4, 4, 4, 3. These are the same as the results for h = k (Figure 1.9(b)). This means that the pipeline depth has little impact on determining the optimal strides for pipelined architectures.

20 20 ENERGY EFFICIENT INTERNET INFRASTRUCTURE 20 (a) Optimal Stride (B s = 2) 20 (b) Optimal Stride (B s = 6) Watts 10 Watts h h Figure 1.10 B s = 6. Power results of the pipelined architecture using (a) B s = 2 and (b) 1.4 ARCHITECTURAL OPTIMIZATION TO REDUCE DYNAMIC POWER DISSIPATION This section exploits several characteristics of Internet traffic and of the pipeline architecture, to reduce the dynamic power consumption of SRAM-based IP forwarding engines. First, as observed in [20], Internet traffic contains a large amount of locality, where most packets belong to few flows. By caching the recently forwarded IP addresses, the number of memory accesses can be reduced so that power consumption is lowered. Unlike previous caching schemes, most of which need an external cache to be attached to the main forwarding engine, we integrate the caching function into the pipeline architecture itself. As a result, we do away with complicated cache replacement hardware and eliminate the power consumption of the hot cache [53]. Second, since the traffic rate varies from time to time, we freeze the logic when no packet is input. We propose a local clocking scheme where each stage is driven by an independent clock and is activated only under certain conditions. The local clocking scheme can also improve the caching performance. Third, we note that different packets may access different stages of the pipeline, which leads

21 ARCHITECTURAL OPTIMIZATION TO REDUCE DYNAMIC POWER DISSIPATION 21 to a varying access frequency onto different stages. Thus we propose a finegrained memory enabling scheme to make the memory in a stage sleep when the incoming packet is not accessing it. Our simulation results show that the proposed schemes can reduce the power consumption by up to fifteen-fold. We prototype our design on a commercial field programmable gate array (FPGA) device and show that the logic usage is low while the backbone throughput requirement (40 Gbps) is met Analysis and Motivation We obtained four backbone Internet traffic traces from the Cooperative Association for Internet Data Analysis (CAIDA) [10]. The trace information is shown in Table 1.4, where the numbers in the parenthesis are the ratio of the number of unique destination IP addresses to the total number of packets in each trace. Table 1.4 Real-life IP header traces Trace Date # of packets # of unique IPs equinix-chicago-a (6.93%) equinix-chicago-b (6.48%) equinix-sanjose-a (6.73%) equinix-sanjose-b (5.24%) Traffic Locality. According to Table 1.4, regardless of the length of the packet trace, the number of unique destination IP addresses is always much smaller than that of the packets. These results coincide with those of previous work on Internet traffic characterization [33]. Due to TCP burst, some destination IP addresses can be connected very frequently in a short time span. Hence, caching has been used effectively in exploiting such traffic locality to either

22 22 ENERGY EFFICIENT INTERNET INFRASTRUCTURE improve the IP forwarding speed [33] or help balance the load among multiple forwarding engines [20]. This paper employs the caching scheme to reduce the number of memory accesses so that the power consumption can be lowered. Traffic Rate Variation. We analyze the traffic rate in terms of the number of packets at different times. The results for the four traces are shown in Figure 1.11, where the X axis indicates the time intervals and the Y axis the number of packets within each time interval. As observed in other papers [28], the traffic rate varies from time to time. Although the router capacity is designed for the maximum traffic rate, power consumption of the IP forwarding engine can be reduced by exploiting such traffic rate variation in real life equinix chicago A equinix chicago B equinix sanjose A equinix sanjose B Figure 1.11 Traffic rate variation over the time. Access Frequency on Different Stages. The unique feature of the SRAM-based pipelined IP forwarding engine is that different stages contain different sets of trie nodes. Given various input traffic, the access frequency to different stages can vary significantly. For example, we used a backbone routing table from the Routing Information Service (RIS) [40] to generate a trie, mapped the trie onto a 25-stage pipeline, and measured the total number of memory accesses

23 ARCHITECTURAL OPTIMIZATION TO REDUCE DYNAMIC POWER DISSIPATION 23 on each stage for the four input traffic traces. The results are shown in Figure 1.12, where the access frequency of each stage is calculated by dividing the number of memory accesses on each stage by that on the first stage. The first stage is always accessed by all packets, while the last few stages are seldom accessed. According to this observation, we should disable the memory access in some stages when the packet is not accessing the memory in that stage Access frequency equinix chicago A equinix chicago B equinix sanjose A equinix sanjose B Stage ID Figure 1.12 Access frequency on each stage Architecture-Specific Techniques We propose a caching-enabled SRAM-based pipeline architecture for powerefficient IP forwarding, as shown in Figure Let H denote the pipeline depth, i.e. the number of stages in the pipeline. These H stages store the mapped trie nodes. Every time the architecture receives a packet, the incoming packet compares its destination IP address with the packets that are already in the pipeline. It will be considered as cache hit if there is a match, even though the packet has not retrieved the next-hop information yet. To preserve the packet order, the packet that is having a cache hit still goes

24 24 ENERGY EFFICIENT INTERNET INFRASTRUCTURE through the pipeline. However, no memory access is needed for this packet, so that the power consumption for this packet is reduced. IP + Next-hop info Packet SRAM SRAM SRAM Next-hop info IP =? Hit / Miss Stage i Stage (i+1) =? SRAM =? Cache hit =? =? Clk_i Valid AND Clk_(i+1) (a) (b) Figure 1.13 Pipeline with (a) inherent caching and (b) local clocking Inherent Caching Most existing caching schemes need to add an external cache to the forwarding engine. However, the cache itself can be powerintensive [53] and also needs extra logic to support cache replacement. The relatively long pipeline delay can also result in low cache hit rates in traditional caching schemes [20]. Our architecture implements the caching function without appending extra caches. As shown in Figure 1.13 (a), the pipeline itself acts as a fully associative cache, where the existing packets in all the stages are matched with the arriving packet. If the arriving packet (denoted as P kt new ) matches a previous packet (denoted as P kt exist ) that is already existing in the pipeline, P kt new has a cache hit even though P kt exist has not retrieved its next-hop information. Then P kt new will go through the pipeline with the cache hit signal set to 1. On the other hand, P kt new does not obtain the next-hop information until P kt exist exits the pipeline. As shown in Figure 1.13 (a), the packet exiting the pipeline will forward its IP address and its retrieved next-hop information to all the previous stages. The packets in the previous stages compare with the forwarded IP address. The packet matching the forwarded IP address will take the forwarded next-hop information as its own and carry the retrieved next-hop information along when traversing the rest of the pipeline.

25 ARCHITECTURAL OPTIMIZATION TO REDUCE DYNAMIC POWER DISSIPATION Local Clocking Most of the existing pipelined IP lookup engines are driven by a global clock. The logic in a stage is active even when there is no packet to be processed. This results in unnecessary power consumption. Furthermore, since the pipeline keeps forwarding the packets from one stage to the next stage at the highest clock frequency, the pipeline will contain few packets if the traffic rate is low. Since the pipeline is built as a cache which is dynamic and sensitive to input traffic, few packets in the pipeline indicates a small number of cached entries, which results in low cache hit rate. To address this issue, we propose a local clocking scheme where each stage is driven by an individual clock. Only one constraint must be met to prevent any packet loss: Constraint 1: If the clock of the previous stage is active and there is a packet in the current stage, the clock of the current stage must be active. Hence, we design the local clocking as shown in Figure 1.13 (b). The clock of a stage will not be active until the stage contains a valid packet and its preceding stage is forwarding some data to the current stage. To prevent clock skew, some delay logic is added in the data path of the clock signal of the previous stage. In a real implementation, we do not use the AND gate which may result in glitches. Instead, we use clock buffer primitives provided by Xilinx design tools [51] Fine-Grained Memory Enabling As discussed earlier, the access frequency to different stages within the pipeline varies. Current pipelined IP forwarding engines keep all memories active for all the packets, which results in unnecessary power consumption. Our fine-grained memory enabling scheme is achieved by gating the clock signal with the read enable signal for the memory in each stage. The read enable signal becomes active only when the packet goes to access the memory in the current stage. In other words, the read enable signal will remain inactive in any of the following four cases: no packet is arriving, the distance value of the arriving packet is larger than 0, the packet has already retrieved its next-hop information, or the cache hit signal carried by the packet is set to 1.

26 26 ENERGY EFFICIENT INTERNET INFRASTRUCTURE Performance Evaluation We prototyped our design (denoted Proposed ) and the baseline pipeline (denoted Baseline ) that did not integrate the proposed schemes, respectively, on FPGA using Xilinx ISE 10.1 development tools. The target device was Xilinx Virtex-5 XC5VFX200T with -2 speed grade. Table 1.5 shows the post place and route results where Both denotes both designs 1. Although our design used more logic resource than the baseline, the design consumed still a small amount of the overall on-chip logic resources. Both designs achieved a clock frequency of 125 MHz while using the same amount of BlockRAMs. Such a clock frequency results in a throughput of 40 Gbps for minimum size (40 bytes) packets, which meets the current backbone network rate. Table 1.5 Resource utilization Design Used Available Utilization Number of Slices Baseline , % Proposed , % Number of bonded IOBs Both % Number of Block RAMs Both % The dynamic power consumption of a pipelined IP lookup engine can be modeled as Equation (1.12), where p denotes the packet to be looked up, H the pipeline depth, N(p) the number of packets, and P ower memory (i, p) and P ower logic (i, p) denote the power consumption of the memory and of the logic in the ith stage by p, respectively. 1 Due to the limitation of the size of on-chip memory, both designs supported 70K prefixes which is 1/4 of the current largest backbone routing table. However, our architecture can be extended by using external SRAMs.

27 ARCHITECTURAL OPTIMIZATION TO REDUCE DYNAMIC POWER DISSIPATION 27 P ower = p H i=1 [P ower memory(i, p) + P ower logic (i, p)] N(p) (1.12) We profiled the power consumption of the memory and of the logic based on our FPGA implementation results. Using the XPower Analyzer tool provided by Xilinx, we obtained the power consumption for the baseline and our design, as shown in Figure As we expected, the power consumption by memory dominated the overall power dissipation of the pipelined IP lookup engine. Figure 1.14 engine Profiling of dynamic power consumption in a pipelined IP lookup Based on the profile data of the power consumption in the architecture, we developed a cycle-accurate simulator for our pipelined IP lookup engine. We conducted the experiments using the four real-life backbone traffic traces given in Table 1.4, and evaluated the overall power consumption. First, we examined the impact of the fine-grained memory enabling scheme. We disabled both inherent caching and local clocking, and then ran the simulation under two different conditions: (1) without fine-grained memory enabling (denoted as wo/ FME ) and (2) with fine-grained memory enabling (denoted as w/ FME ). Figure 1.15 compares the results, where the results of (2) are set to be the baseline and the results of (1) are divided by those of (2).

28 28 ENERGY EFFICIENT INTERNET INFRASTRUCTURE According to Figure 1.15, fine-grained memory enabling can achieve up to 12-fold reduction in power consumption. Figure 1.15 Power reduction with fine-grained memory enabling Second, we evaluated the impact of the inherent caching and the local clocking schemes. We enabled the fine-grained memory enabling scheme, and then ran the simulation under three different conditions: (1) without either inherent caching or local clocking (denoted as Baseline ); (2) with inherent caching but without local clocking (denoted as Cache only ); (3) with both schemes (denoted as Cache + LC ). The results are shown in Figure 1.16 where the results without both schemes are set as the baseline. Without local clocking, the reduction in power consumption using caching was very little, because of the low cache hit rate (e.g. 1.65% for equinix-chicago-b). Local clocking improved the cache hit rate (e.g. the cache hit rate for equinixchicago-b increased to 45.9%), which resulted in higher reduction in power consumption. Overall, when all the three proposed schemes were enabled, the architecture achieved 6.3, 15.2, 8.1, and 7.8 -fold reduction in power consumption, for the four traffic traces, respectively.

29 RELATED WORK 29 Figure 1.16 Power reduction with inherent caching and local clocking 1.5 RELATED WORK Reducing the power consumption of network routers has been a topic of significant interest [14, 7, 38]. Most of the existing work focuses on system- and network-level optimizations. Chabarek et al. [7] enumerate the power demands of two widely used Cisco routers. The authors further use mixed integer optimization techniques to determine the optimal configuration at each router in their sample network for a given traffic matrix. Nedevschi et al. [38] assume that the underlying hardware in network equipment supports sleeping and dynamic voltage and frequency scaling. The authors propose to shape the traffic into small bursts at edge routers to facilitate sleeping and rate adaptation. Power-efficient IP lookup engines have been studied from various aspects. However, to the best of our knowledge, little work has been done on pipelined SRAM-based IP lookup engines. Some TCAM-based solutions [52, 54] propose various schemes to partition a routing table into several blocks and perform IP lookup on one of the blocks. Similar ideas can be applied for SRAMbased multi-pipeline architectures [21]. Those partitioning-based solutions for power-efficient SRAM-based IP lookup engines do not consider either the un-

30 30 ENERGY EFFICIENT INTERNET INFRASTRUCTURE derlying data structure or the traffic characteristics and are orthogonal to the solutions proposed in this chapter. Kaxiras et al. [27] propose a SRAM-based approach called IPStash for power-efficient IP lookup. IPStash replaces the full associativity of TCAMs with set associative SRAMs to reduce power consumption. However, the set associativity depends on the routing table size and thus may not be scalable. For large routing tables, the set associativity is still large, which results in low clock rate and high power consumption. Traffic rate variation has been exploited in some recent papers for reducing power consumption in multi-core processor based IP lookup engines. In [35], clock gating is used to turn off the clock of unneeded processing engines of multi-core network processors to save dynamic power when there is a low traffic workload. In [30], a more aggressive approach of turning off these processing engines is used to reduce both dynamic and static power consumption. Dynamic frequency and voltage scaling are used in [28] and [37], respectively, to reduce the power consumption of the processing engines. However, those schemes still consume large power in the worst case when the traffic rate is consistently high. Some of those schemes require large buffers to store the input packets so that they can determine or predict the traffic rate. But the large packet buffers result in high power consumption. Also, these schemes do not consider the latency for the state transition, which can result in packet loss in case of bursty traffic. 1.6 SUMMARY Power (Energy) consumption has emerged as a new challenge in the design of packet forwarding engines for next generation Internet infrastructure. Though TCAMs are de facto solutions for high-speed packet forwarding, they suffer from high power consumption. We propose mapping state-of-the-art algorithmic solutions onto parallel architectures that are based on low-power memory such as SRAM.

31 REFERENCES 31 We exploited data structure optimization to reduce the power consumption. We formulated the problems by revisiting the conventional time-space trade-off in multi-bit tries. To minimize the worst-case power consumption for a given architecture, a dynamic programming framework was developed to determine the optimal strides for constructing tree bitmap coded multi-bit tries. Simulation using real-life backbone routing tables showed that careful design of the data structure, with awareness of the underlying architecture, could achieve dramatic reduction in power consumption. Different architectures could result in different optimal data structures with respect to power efficiency. We proposed several novel architecture-specific techniques to reduce the dynamic power dissipation in SRAM-based pipelined IP lookup engines. First, the pipeline was built as an inherent cache that exploited effectively the traffic locality with minimum overhead. Second, a local clocking scheme was proposed to exploit the traffic rate variation and to improve the caching performance. Third, a fine-grained memory enabling scheme was used to eliminate unnecessary memory accesses for the input packets. Simulation using real-life traffic traces showed that our solution achieved up to fifteen-fold reduction in dynamic power dissipation. REFERENCES 1. Mohammad J. Akhbarizadeh, Mehrdad Nourani, Deepak S. Vijayasarathi, and T. Balsara. A non-redundant ternary CAM circuit for network search engines. IEEE Trans. VLSI Syst., 14(3): , Gary Anthes. Data Centers Get A Makeover. November Florin Baboescu, Dean M. Tullsen, Grigore Rosu, and Sumeet Singh. A tree based router search engine architecture with single port memories. ISCA, pages , In Proc.

32 32 ENERGY EFFICIENT INTERNET INFRASTRUCTURE 4. Anindya Basu and Girija Narlikar. Fast incremental updates for pipelined forwarding engines. In Proc. INFOCOM, pages 64 74, CACTI Lorenzo De Carli, Yi Pan, Amit Kumar, Cristian Estan, and Karthikeyan Sankaralingam. Flexible lookup modules for rapid deployment of new protocols in high-speed routers. In Proc. SIGCOMM, Joseph Chabarek, Joel Sommers, Paul Barford, Cristian Estan, David Tsiang, and Stephen Wright. Power awareness in network design and routing. In Proc. INFOCOM, pages , Cisco ASR 1000 Series Aggregation Services Routers. sheet c pdf. 9. Cisco CSR-1 Carrier Routing System Colby Walsworth, Emile Aben, kc claffy, Dan Andersen. The caida anonymized 2009 internet traces dataset.xml. 11. Cypress Sync SRAMs Minh Q. Do, Mindaugas Drazdziulis, Per Larsson-Edefors, and Lars Bengtsson. Parameterizable architecture-level SRAM power model using circuit-simulation backend for leakage calibration. In Proc. ISQED, pages , Will Eatherton, George Varghese, and Zubin Dittia. Tree bitmap: hardware/software IP lookups with incremental updates. SIGCOMM Comput. Commun. Rev., 34(2):97 122, Maruti Gupta and Suresh Singh. Greening of the Internet. In Proc. SIGCOMM, pages 19 26, Pankaj Gupta, Steven Lin, and Nick McKeown. Routing lookups in hardware at memory access speeds. In Proc. INFOCOM, pages , Pankaj Gupta and Nick McKeown. Algorithms for packet classification. IEEE Network, 15(2):24 32, Jahangir Hasan and T. N. Vijaykumar. Dynamic pipelining: making ip-lookup truly scalable. In Proc. SIGCOMM, pages , 2005.

33 REFERENCES IDT Network Search Engines Weirong Jiang and Viktor K. Prasanna. A memory-balanced linear pipeline architecture for trie-based IP lookup. In Proc. Hot Interconnects (HotI 07), pages 83 90, Weirong Jiang and Viktor K. Prasanna. Parallel IP Using Multiple SRAMbased Pipelines. In Proc. IPDPS, Weirong Jiang and Viktor K. Prasanna. Towards green routers: Depth-bounded multi-pipeline architecture for power-efficient IP lookup. In Proc. IPCCC, pages , Weirong Jiang and Viktor K. Prasanna. Reducing dynamic power dissipation in pipelined forwarding engines. In Proc. ICCD, Weirong Jiang, Qingbo Wang, and Viktor K. Prasanna. Beyond TCAMs: An SRAM-based parallel multi-pipeline architecture for terabit IP lookup. In Proc. INFOCOM, pages , Juniper Networks SRX 5000 Services Gateways Juniper Networks T-series Routing Platforms Juniper Networks T1600 Core Router Stefanos Kaxiras and Georgios Keramidas. IPStash: a set-associative memory approach for efficient IP-lookup. In INFOCOM, pages , Alan Kennedy, Xiaojun Wang, Zhen Liu, and Bin Liu. Low power architecture for high speed packet classification. In Proc. ANCS, pages , Kun Suk Kim and Sartaj Sahni. Efficient construction of pipelined multibit-trie router-tables. IEEE Trans. Comput., 56(1):32 43, Ravi Kokku, Upendra B. Shevade, Nishit S. Shah, Mike Dahlin, and Harrick M. Vin. Energy-Efficient Packet Processing. In Sailesh Kumar, Michela Becchi, Patrick Crowley, and Jonathan Turner. CAMP: fast and efficient IP lookup architecture. In Proc. ANCS, pages 51 60, 2006.

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong Jiang, Qingbo Wang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern