Ting-Sheng Chen, Ding-Yuan Lee, Tsung-Te Liu, and An-Yeu Wu, Fellow, IEEE

Size: px

Start display at page:

Download "Ting-Sheng Chen, Ding-Yuan Lee, Tsung-Te Liu, and An-Yeu Wu, Fellow, IEEE"

Brice Golden
6 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER Dynamic Reconfigurable Ternary Content Addressable Memory for OpenFlow-Compliant Low-Power Packet Processing Ting-Sheng Chen, Ding-Yuan Lee, Tsung-Te Liu, and An-Yeu Wu, Fellow, IEEE Abstract OpenFlow, the main protocol for software-defined networking (SDN) advocated by Google, requires multiple flow tables with long and variable rule lengths. For fast packet forwarding, using ternary content-addressable memory (TCAM) as implementation of lookup-table has displayed superior performance in executing the conventional packet classification algorithms. However, applying traditional TCAM designs to an OpenFlow switch leads to problems of insufficient word length, inflexible table configuration, and high-power consumptions. This paper presents a low-power dynamic reconfigurable TCAM design for OpenFlowcompliant packet processing. It can support long and variable word lengths for different flow tables, with less than 5% of area overhead. To reduce dynamic power consumption without sacrificing throughput, the proposed hybrid-pipelined structure based on NAND-NOR CAM and pipelined-tcam, can achieve lower energy-delay products than traditional TCAM. Moreover, the proposed self-power gating technique controlled by the mask data optimizes both NAND and NOR core cells, which lowers power consumption without additional control and routing overhead. Finally, the low-power reconfigurable TCAM is implemented in TSMC 40 nm CMOS technology. The implementation results show that the energy metric of the proposed reconfigurable TCAM is only fj/bit/search, at a 416 MHz operating frequency. Thus, the proposed TCAM design is very suitable for the emerging high performance packet processing in future SDN applications that need long and variable word lengths. Index Terms High-throughput, low-power, OpenFlow, packet classification, reconfigurable VLSI design, software-defined networking (SDN), ternary content -addressable memory (TCAM). I. INTRODUCTION WITH growing global data traffic, the backbone network of cloud services is facing more challenging and severe requirements. To meet the demands of the backbone network, Software-Defined Networking (SDN) was proposed in 2009 [1]. SDN has since become one of the most promising archi- Manuscript received February 21, 2016; revised June 9, 2016; accepted June 22, Date of publication September 13, 2016; date of current version September 27, This work was supported in part by the Ministry of Science and Technology, ROC, under Grant E , E , and E MY2. This paper was recommended by Associate Editor R. Azarderakhsh. The authors are with the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan ( tim@access.ee.ntu.edu.tw; dan@access.ee.ntu.edu.tw; ttliu@ ntu.edu.tw; andywu@ntu.edu.tw). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSI Fig. 1. A simplified block diagram of OpenFlow switch. tectures for managing large-scale complex networks. Currently, OpenFlow, which is advocated by Google [2], [3], is the most popular SDN protocol/standard and has been standardized [4]. As shown in Fig. 1, to achieve programmability, an OpenFlow switch requests multiple flow tables with various match-field. Rule lengths in each table ranging from 32 to 568 bits. Recently, Ternary Content Addressable Memory (TCAM) has been applied to packet processing to forward packets rapidly [5]. Owing to the parallel architecture, TCAM displays superior performance in executing packet classification algorithm [6]. Although TCAM design had been intensively studied for computer architecture applications [7] [10], three major issues are yet to be addressed when using TCAM to implement a lookup table in OpenFlow-compliant switches, as shown in Fig. 1: 1) Insufficient word length: The current version of the OpenFlow switch must support at least 14 match fields, with the rule lengths in a flow table may reach 568 bits. However, the word lengths of most current TCAM designs are insufficient. Moreover, TCAM with long word lengths cause a long delay path, even resulting in the worse noise margin. 2) Inflexible table configuration: Multiple-table processing enables packet classification to be programmable and stable. Rule lengths in each table are different. However, traditional TCAM designs with fixed word lengths are not suitable for variable rule lengths. Applying traditional TCAM designs to OpenFlow switches will lead to inflexible table configuration and wastage of precious TCAM resources IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

1662 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 2. Proposed dynamic-reconfigurable TCAM for OpenFlow-compliant switches. Fig. 3.

2 1662 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 2. Proposed dynamic-reconfigurable TCAM for OpenFlow-compliant switches. Fig. 3. Example of OpenFlow multiple-table processing. 3) High-power consumption: OpenFlow switches need greater number of flow tables than traditional network switches. TCAM is very power-hungry, consuming the largest portion (as much as 25%) of total power in a highend network switch, as discussed in [28]. Multiple-table processing causes more substantial static/dynamic power consumptions. In this paper, we aim to develop a dynamic-reconfigurable lowpower TCAM for OpenFlow-compliant switches. As depicted in Fig. 2, the main contributions of this paper are as follows: 1) To reduce static power consumption, we optimize both NAND and NOR core cells. Without additional control and routing overheads, the proposed self-power gating technique controlled by mask cells can reduce static power by 42% and dynamic power consumption by 15% at the cell level. 2) To lower the dynamic power consumption without sacrificing the throughput rate, we propose a hybrid-pipelined structure by combining both NAND-NOR TCAM and pipeline TCAM. Furthermore, we analyze the appropriate size for the NAND-type TCAM, further minimizing the energy-delay product by 45% at the structural level. 3) We propose a dynamic reconfigurable TCAM to handle long and variable word lengths. The proposed Filter-Based Multi-Mode Cascaded (FBMMC) architecture can flexibly support word lengths ranging from 144 to 576 bits. Moreover, the area and timing overheads of this reconfigurable architecture are relatively low, achieving less than 5% of area overhead. To validate our design, we implement the dynamic reconfigurable low-power TCAM in TSMC 40-nm CMOS technology. The simulation results show that the energy metric is fj/ Bit/Search at a 416 MHz operating frequency. The rest of this paper is organized as follows. Section II briefly introduces the principles of the OpenFlow switch and traditional TCAM architecture. Section III presents the self-power gating technique and hybrid-pipelined structure for low-power TCAM operation. We illustrate the proposed dynamic reconfigurable TCAM in Section IV. Section V compares this work to other related works. Section V concludes this paper. II. ARCHITECTURE OF TCAM MACRO IN NETWORK SWITCHES A. Operations of OpenFlow Switch Fig. 1 shows a simplified block diagram of an OpenFlow switch. It utilizes the OpenFlow protocol and channel to communicate with OpenFlow controllers. It can adjust the routing policy of switches, depending on the network topology and traffic status, by adding, modifying, or removing the rules of flow tables. There are multiple flow tables in an OpenFlow switch. Multiple-table processing enables packet classification to be programmable and stable. As shown in Fig. 3, when receiving a packet, the packet header is sent to flow-table_0, which is typically an initial classifier. It decides which table can further process this packet header, and then forwards it to that flow table. The Current Openflow switch (V1.5) support at least 14 fields, as shown in Table I. Each table can focus on the selected tuples as the match field, with each field having mask ability. The lengths of the forwarding rules in each table are different. The rule lengths in each flow table range from 32 bits (e.g., IPv4_dst.) to 568 bits (14 match fields). In general, a flow table with a longer rule length gives rough classification, which means that it has fewer entries [11]. On the other hand, a flow table with less tuples (e.g., a firewall) has more entries [12]. Compared with the conventional switch, an OpenFlow switch can process multiple layers (L1 to L4) of a packet header. However, the large-sized and multiple tables cause considerable power consumption. Thus, flow tables for OpenFlow switch with reconfigurable, high-performance, and low-power design is essential for the backbone network in the future. B. Traditional TCAM Architecture Using TCAM as implementation of lookup-table displays superior performance in executing conventional packet classification algorithm. Owing to the parallel architecture, TCAM can finish searching and comparison of a packet header in one or a few clock cycles. Fig. 4 shows the block diagram of CAM. A search word is broadcasted onto SearchLines (SL) of the table of stored words. The size of the words is L bits and the number of entries is typically from a few hundred to a few thousand. Each word has a MatchLine (ML) that indicates whether the

Fig. 4. Block diagram of content addressable memory (CAM).

3 CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1663 TABLE I EXAMPLE OF OPENFLOW 14-FIELD PACKET CLASSIFICATION RULE SET TABLE II SUMMARY OF NAND AND NOR-TYPE TCAM Fig. 4. Block diagram of content addressable memory (CAM). Unlike the NAND-type TCAM that pulls ML down when all match occurs, the NOR-type TCAM discharges the ML to the ground in case of any miss, as shown in Fig. 5(b). The mask bit can turn off the pull-down transistor of the NOR cell. Table II summarizes the differences between NAND- and NOR-type TCAMs. Because of the long serial pass transistor delay and threshold drop, the NAND-type TCAM requires more evaluation time than the NOR-type TCAM at the same word length. In contrast, the NOR-type TCAM can provide excellent performance owing to its short critical path delay. Dynamic power consumption comprises ML and SL power consumption. SL power consumption is related to the word length and input pattern. As for ML power consumption, owing to the low probability of rule matching and load capacitance, the switched capacitance derived from multiplying switching activity and load capacitance is also low, resulting in minimal dynamic power consumption. On the other hand, because of high switching activity and load capacitance, the NOR-type TCAM consumes more power consumption than the NANDtype TCAM. Fig. 5. Conventional L-bit TCAM: (a) NAND-type TCAM; (b) NOR-type TCAM with a matched case; (c) NOR-type TCAM with 1-bit mismatched case. search data and the stored word are the same. The priority encoder outputs the matched location. The CAM with mask ability is called ternary CAM (TCAM). The masked bit of the TCAM always matches the search bit, regardless of the content of the search bit. There are two types of TCAM: NAND- and NOR-type TCAM.Fig.5(a)showstheNAND-type TCAM. When MLPre is logic 1, ML_NAND is pre-charged to high voltage by PMOS P 2. Search data are sent to the XNOR cell, and the NMOS N1 is turned off to avoid short current to the ground. When MLPRe is logic 0, PMOS P 2 stops charging the ML. If the input data match the stored rule, then the ML will be discharged to the ground. The mask bit can always turn on the pass transistor of the NAND cell, regardless of the match result of the XNOR cell. C. Discontinuous Distribution of Mask Data Because the OpenFlow switch gathers multiple match fields in a flow table, the continuous feature of mask data rarely appears in many-field classification. Therefore, traditional TCAM designs for single-field classification [14], [16], [23] (e.g., IPv6 destination) that utilize the continuous feature of mask data are unsuitable for OpenFlow switch or many-field packet classification with discontinuous distribution of mask data. Priority encoder is not suitable for multiple-field classification, because the priority of a rule is no longer consistent with the proportion of mask data itself. Furthermore, the use of priority encoder results in increased energy consumption of pattern updates and search operations. Hence, for multiple-field classification, the concept of Priority Decision in Memory [13] and the priority field shown in Table I can be utilized to decide the match result. D. Challenge of Long-Word Lengths TCAM Recently, a TCAM with butterfly matchline (Butterfly- TCAM) was presented by [14] with low energy consumption

4 1664 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 solutions take advantage of the continuous feature of mask data, which can be applied only to single-field packet classification, but not to many-field packet classification. Fig. 7 shows the proposed self-power gating technique (S-PG) for both the NAND- andnor-type TCAM. The match results of NAND and NOR-type TCAM, Match NAND and Match NOR is given by: Fig. 6. Power gating technique utilizes a continuous feature of mask data, with additional control and routing overhead proposed in [14]. design. A TCAM with Low-swing MatchLine Sense amplifier (LVMLSS-TCAM) was developed by [15]. However, both Butterfly and LVMLSS TCAMs did not consider long word lengths and reconfigurable architecture. TCAM with long word lengths will result in larger ML capacitance, which causes greater dynamic power consumption than traditional TCAM. Furthermore, TCAM with longer word lengths also affects search operation. Shown in Fig. 5(b) and (c), the total ML currents of an L-bit NOR-type TCAM for the cases of match and 1-bit mismatch can be written as follows: I ML_match = L I OFF (1) I ML_1bit_mismatch =(L 1) I OFF + I ON (2) where I ON and I OFF are the currents of the matched and mismatched cells, respectively. Equations (1) and (2) show that if the ratio I ON /I OFF decreases or the word length L increases, then distinguishing a matched case from a 1-bit mismatch case would be difficult. As the result, when designing TCAM with long word length, we should reduce the leakage current and avoid lengthening the word length directly. III. LOW-POWER TCAM DESIGN FOR OPENFLOW To reduce power consumption, we propose a self-power gating technique at the cell level and a hybrid-pipelined TCAM at the structural level. These techniques are the fundamental processing elements of the proposed low-power dynamic reconfigurable TCAM for OpenFlow-compliant switch. A. Self-Power Gating for NAND and NOR Cells Most power reduction techniques in the literature focus on dynamic power reduction. However, the evolution of semiconductor process technology has resulted in leakage power playing an important role in the total power consumption. As mentioned before, the leakage component of the TCAM also affects the search operation. Therefore, it is important to lower the leakage component of TCAM circuit. To reduce static power consumption, power gating is a popular technique at the circuit level. However, the area overhead of the power gating technique due to global signal routing can be substantial. Fig. 6 shows a super-cut off power gating technique with complex control scheme was proposed in [14]. A twoside self-gating technique to reduce leakage power of the mask cell and control overhead was proposed in [16]. However, these Match NAND = ( (SL 1 D 1 + M 1 ) (SL n D n + M n ) ) (3) Match NOR = ( (SL 1 D 1 ) M 1 + +(SL n D n ) M n ) where SL i, D i,andm i are search data, stored data, and mask bit, respectively. If a mask bit is logic 1, the SL i D i in (3) and SL i D i in (4) are not important for the final match result. Owing to the mask cells work as usual, the search delay caused by this power gating technique is very small, which is consistent with the results of [14] and [16]. Therefore, S-PG technique turns off the XNOR cell in the NAND-type TCAM and the XOR cell in the NOR-type TCAM depending on the mask data of each bit. The additional PMOS and NMOS transistors help reduce the leakage current of the 6 T SRAM storage cell. Moreover, unlike the dynamic power source technique proposed in [17] that unable to turn-off the binary cell completely, the proposed technique lowers the storage data D and D to 0 V, further reducing the gate leakage of NMOS N1 and NMOS N2, and switching activities of NMOS N3, as shown in Table III. On the other hand, if the mask bit M in Fig. 7 is logic 0, then the XNOR and XOR cells work like conventional TCAM. B. Hybrid-Pipelined TCAM in OpenFlow Operations Many power reduction techniques are proposed to reduce TCAM power consumption [5], such as low-swing ML [15] and low-swing SL [18]. For dynamic power consumption, hierarchical ML is a very popular architecture to reduce high ML power consumption, such as pipelined-structure TCAM proposed in [19] and pulse NAND-NOR CAM (PNN-CAM) proposed in [20]. PNN CAM takes advantage of both NAND CAM and NOR TCAM simultaneously. The NAND-type CAM as a rule filter, consumes less dynamic power than the NORtype TCAM. On the other hand, the NOR-type TCAM provides excellent performance as the main-search component. Although the filter can lower the activation of main search components, the main drawback of the NAND ML should be well considered. The quadratic delay dependence comes from the fact that adding a NAND cell to a NAND ML adds both a series resistance and a capacitance to the ground because of the NMOS diffusion capacitance. Hence, most NAND-type TCAMs limit the bits of the NAND ML cells to 8 16 in order to limit the quadratic degradation in speed [5]. The additional delay caused by the filter result in throughput degradation. When traffic congestion occurs in the Internet, throughput degradation can be serious and may even lead to packet dropping, due to the packet buffer running out. Although the Early-Predict (4)

5 CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1665 Fig. 7. Proposed self-power gating technique, without additional control and routing overhead: (a) NAND-type TCAM; (b) NOR-type TCAM. TABLE III TWO MODES OF THE SELF-POWER GATING TECHNIQUES Late-Correct (ELPC) scheme proposed in [21] can improve throughputs of hierarchical ML TCAM, it causes extra dynamic power consumption in the main-search component in the case of a miss-prediction. To improve the throughput, 1-bit pipeline registers are inserted between the NAND- and NOR-type TCAMs, as shown in Fig. 8. Owing to the discontinuous feature of the mask data in an OpenFlow switch, i.e., the mask data may appear in the rule filters, we replace the NAND-CAM with NAND-TCAM. Unlike the EPLC scheme, which complete search operation in one clock cycle, the pipeline registers can divide the search operation into two stages. The activation of PMOS P 3 in Fig. 8, EN,isgivenby EN = MLPre Match (5) where Match is the match information of the previous stages. If match occurs in the first stage, Match is logic 0 and this match information is stored in the left part of the pipeline register. It will be passed to the right part of the pipeline register and PMOS P 3 will be turned on to pre-charge the NOR ML in the subsequent clock cycle. If no match occurs, then PMOS P 3 will be turned off until a match occurs again in the pre-search filter. Owing to the body effect, the threshold voltage of NMOS N2 in Fig. 8 is greater than that of NMOS N3. Match arrives slower than MLPre because of the propagation delay of the pipeline register. To avoid the glitch that results in unnecessary pre-charge power consumption shown in Fig. 9, MLPre is connected to NMOS N2, and Match is connected to NMOS N3.To ensurethe functionalcorrectness, if ML_NAND transmits missinformation, the ML_NOR will be discharge to the ground by NMOS N4 in Fig. 8 in the next clock cycle. Fig. 10 presents a ML Sense Amplifier (MLSA) with a gated clock design. If a match occurs in the first stage, then ML_NOR will be pre-charged to VDD. Meanwhile, the MLSA will be turned on and ML_SENSE will be set to logic 1. If a mismatch occurs in the main-search component, then the MLSA senses the mismatch condition and pulls ML_SENSE to the ground once ML_NOR is discharged to V tp. On the other hand, the MLSA will not be turned on if a mismatch case occurs in the first stage, thus reduce unnecessary power consumption. C. Analysis of Hybrid-Pipelined TCAM Before inserting pipeline registers, the search latency T delay and the search frequency f search of the NAND-NOR TCAM needs to satisfy the following conditions: T delay >T NAND + T NOR, f search < 1 T NAND + T NOR (6) where T NAND and T NOR are the delay time of the NANDand NOR-type TCAMs, respectively, and are dependent on the

6 1666 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 8. Pipeline register with selective pre-charge logic is inserted between a 9-bit NAND-type TCAM and a 135-bit NOR-type TCAM. The precharge PMOS P3 of NOR matchline is turned on only when the filter match in the last cycle. order to achieve maximum power reduction, we analyze the expected value of power consumption P EX, given as follows: ( ) k 3 P EX = (NANDmatch(k)+NORdyn(L K)) 4 ( ( ) ) k (NAND miss (k)+nor leak (L K)) (8) 4 Fig. 9. Glitch resulting in unnecessary power consumption. Fig. 10. MLSA with gated clock design. where NAND miss (k) and NAND match (k) represent power consumption of mismatch and match cases, respectively, occurring at the K-bit NAND-type TCAM, and NOR dyn (L K) and NOR leak (L K) are the dynamic and static power consumption of mismatch and match cases, respectively, occurring at the K-bit NAND-type TCAM, and NOR dyn (L K) and NOR leak (L K) are the dynamic and static power consumptions, respectively, of the (L K) bit. In this study, we consider L = 144 for IPv6 lookup tables [22]. The value of k is scanned from 8 to 20 bits, and K =9 is considered as it minimizes P EX. The modified NAND-NOR TCAM can maintain low power consumption and high throughput rate. number of bits. After pipeline registers are inserted, the search frequency f search_pipe is given by: ( ) 1 1 f search_pipe < min, (7) T su + T cq + T NOR T NAND where T su and T cq are the setup time and the propagation delay of the pipeline registers, typically much smaller than T NAND and T NOR. Pipeline registers would almost double throughput when T NAND nearly equal to T NOR. The pipeline overheads are a few latencies, area and power caused by pipeline itself, these overheads are acceptable since only 1-bit pipeline register per ML are inserted. Except for throughput degradation, the filter with long word lengths consumes more power in the first stage. The content of the stored data is typically around 50% don t care data and 50% logic 0 and 1, and the content stored in the filter is only the left k-bit of the table. If the matching probability of each bit in the NAND-type TCAM is 0.75, then the matching probability of the filter is (3/4) k. Due to the low probability of main-search activation, the average bit of the activated power-demanding main search is reduced from (L K) to (L K)(3/4) k.in D. Layout and Simulation Results Self-Power Gating Technique: The proposed S-PG technique and hybrid pipelined TCAM is implemented by Cadence Virtuoso with TSMC 40 nm library CMOS technology at 0.9 V supply voltage. The search delay is 2.4 ns. All of the parameters, including interconnects and parasitic, are extracted from the circuit layout. The post-simulation is done by HSPICE simulation tool. Layout views of the NAND and NOR-type TCAM core cell are shown in Fig. 11(a) and (b), respectively. BL and SL are separated for lower switching capacitance. Without additional control and routing overheads, the area overhead of the S-PG technique is reasonable. We constructed a 256-word TCAM structure with 135-bit NOR-type and 9-bit NAND-type TCAMs. The static power consumption is evaluated for the cases when the mask bit is on and off, while the dynamic power consumption is evaluated for the mismatch case. The simulated dynamic power consumption is normalized to an energy metric, measured in femtojoules per bit per search. Table IV shows the simulated power results with the proposed S-PG technique. We can see a significant power reduction with the proposed S-PG technique when the mask bit is

CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1667 TABLE V COMPARISONS WITH OHER RELATED WORKS: POWER GATING TECHNIQUES Fig. 11.

7 CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1667 TABLE V COMPARISONS WITH OHER RELATED WORKS: POWER GATING TECHNIQUES Fig. 11. Layout: (a) NAND core cell with SPG. (b) NOR core Cell with SPG. TABLE IV POWER REDUCTION OF THE SELF-POWER GATING TECHNIQUE Fig. 12. Post-layout waveforms of TCAM with self-power gating technique. enabled. Only a slight increase (5%) in the leakage power when the mask bit is disabled. The shutdown of the XOR and XNOR cells reduces the static power consumption by nearly half. The proposed S-PG technique also lowers the dynamic power consumption because NMOS N3 is disabled. If the mask data is logic 1, then traditional TCAM without the S-PG technique would consume more dynamic power from unnecessary N3 switching, as shown in Fig. 12. Table V shows a comparison of the proposed S-PG technique and other techniques [14], [16]. Our technique not only effectively reduces dynamic and static power consumption of both the NAND- and NOR-type TCAMs, but serves well for discontinuous and discrete mask data in multiple-field packet classification and other TCAM applications. It is well suited for OpenFlow-compliant packet processing and the hybridpieplined TCAM, which will be discussed in detail below. Hybrid-Pipelined TCAM Structure Based on S-PG Cell: Fig. 13 shows the results of a performance simulation for the hybrid-pipelined TCAM. The timing stability of the TCAM circuit is related to the critical path: the ML needs to be pulled down in time when a match and a mismatch case occur in the NAND- andnor-type TCAMs, respectively. The low-power sense amplifier is enabled only when a match occurs in the filter, which accelerates the sensing time of the NOR ML. Table VI indicates the throughput improvement by the proposed hybrid-pipelined TCAM. Compared with the NAND- NOR TCAM without a pipelined structure, the hybrid-pipelined TCAM can almost double the throughput rate. Although inserting the pipelines would cause slight latency overhead, it can be ignored for the following two reasons. First, the packet classification in an SDN environment can have the shortest or the best routing path, depending on the network topology and traffic conditions. Thus, a slight additional latency in a packet classification application is acceptable. Second, the TCAM itself has displayed superior performance compared with the other packet classification algorithms. Therefore, a slight latency caused by pipeline registers can be ignored, as the overall search latency of the TCAM is still considerably low. Table VII shows the energy-delay products of even and hybrid-pipelined TCAMs. Because the 9-bit NAND-type filters are enough to filter the rules efficiently, the main-search components are seldom activated, and hence, only static power is consumed. The average energy metric is lower than that of the even-pipelined TCAM. Although hybrid-pipelined TCAM requires more evaluation time compared with the even-pipelined TCAM, it has 45% lower and near-optimal energy-delay product. Both the filter and main-search component not only consume very little dynamic power but also maintain reasonable throughput. This hybrid-pipelined TCAM as a basic processing element is well suited to the following dynamic reconfigurable TCAM. IV. FILTER-BASED MULTI-MODE CASCADED TCAM The proposed FBMMC-TCAM for OpenFlow-compliant packet processing can achieve both long and variable word lengths. Furthermore, the power consumption is relatively low, with low area and timing overhead.

1668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 13. Post-layout waveforms of the hybrid-pipelined TCAM based on S-PG cell.

8 1668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 13. Post-layout waveforms of the hybrid-pipelined TCAM based on S-PG cell. TABLE VI THROUGHPUT IMPROVEMENT OF HYBRID-PIPELINED TCAM TABLE VII ENERGY-DELAY PRODUCT: EVEN-PIPELINED V.S. HYBRID-PIPELINED Fig. 14. A TCAM with flexible search mode proposed in [24]. A. Background and Motivation As explained earlier, the lengths of the forwarding rules in each table of an OpenFlow switch are different. The rule length in each flow table range from 32 bits (e.g., IPv4_dst.) to 568 bits (14 tuples). Most traditional TCAM architectures can neither support a long rule length with low power nor reconfigure the rule length if required. Cisco was also aware of the problem of variable matching sizes and therefore proposed Depth Cascading technique [29]. This technique can realize various matching sizes such as 72, 144, and 288 bits by cascading several individual CAM in serial. However, it suffers from huge search latency. Moreover, this technique leads to various search latencies with different matching sizes, causing complex control overheads. On the other hand, the proposed FBMMC architecture takes advantage of parallel architecture to configure different word lengths, therefore realizing constant search latency and lower control overheads. A TCAM with 32-, 64-, and 128-bit word lengths that takes advantage of the continuous feature of mask data to achieve storage efficiency has been proposed [23]. However, this design is only suitable for IPv6 applications, which do not work in multiple-field classification or an application without continuous mask data. Fig. 14 presents a TCAM utilizes flexible priority encoder to achieve flexible table option that was implemented in [24]. If the rule length is 320 bits, then a match occurs at all four parts of the 80-bit TCAM, which implies that a match of 320 bits occurs. This search mechanism can extend rule lengths instead of lengthening the word length directly. However, without any rule filter, the energy metric (fj/bit/search) is ten times more than the butterfly TCAM [14]. In addition, the use of priority encoder also results in increased energy consumption of pattern updates and search operations, furthermore, priority encoder is not suitable for multiple-field classification. Therefore, we propose a FBMMC-TCAM architecture, which features low power consumption and area overhead for OpenFlow-compliant switches. The proposed dynamic reconfigurable TCAM has a flexible triple-search mode. A single search mode that stores 256 rules and 144 bits of word lengths per rule, a dual-search mode that stores 128 rules and 288 bits of word lengths per rule, and a quad-search mode that stores up to 64 rules and 576 bits of word length per rule are all achieved. B. Operations and Architecture Fig. 15 shows the architecture of the proposed FBMMC- TCAM. Four hybrid-pipelined TCAMs are grouped as a multi-mode processing element. The multiplexer between the

CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1669 Fig. 15. Architecture of the proposed FBMMC-TCAM.

9 CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1669 Fig. 15. Architecture of the proposed FBMMC-TCAM. TABLE VIII WORD LENGTHS OF DIFFERENT MODE SETTINGS NAND-type TCAM and the pipeline can transmit match information depending on the mode setting. Table VIII indicates the different mode settings of the proposed reconfigurable TCAM. Mode 1 has 144 bits per rule, and the match information of each filters are independent of each other. In Fig. 15, mode 1 represents four independent rules, and the output shifter shifts four output results to ML_out. When in mode 2 (dual-search mode), as shown in Fig. 16, the first and second hybrid-pipelined TCAMs are grouped together, and the third and fourth hybrid-pipelined TCAMs are also cascaded to accomplish a dual-search mode. As a result, mode2 has = 288 bits per rule, including 9 2=18bits for the rule filter. To prevent the pre-charge circuit of the NOR-type TCAM from false activation (e.g., when the search key matches a 9-bit filter but misses the other 9-bit filter, this match information will cause false activation of the main search component), a1-bitnor gate is inserted to check whether the search key matches both the 9-bit filters. False activation would results in unnecessary power consumption but also a false match result. The output shifter shifts two match results to ML_out. In mode 3 (quad-search mode), four hybrid-pipelined TCAMs are cascaded. Because 540 bits of main-search components have greater power consumption than in the case of mode 1 mode 2, four rule filters are grouped to provide stricter rule filtering. The block diagram of the designed FBMMC-TCAM is shown in Fig. 17. The operation is divided into three phases: in the first phase, a control unit and an IO circuit are used to handle mode setting, rule searching, and rule updating. Because of the dualport design, the second phase can simultaneously perform rule searching and updating. The last phase outputs a match result by 4 64 shift registers. These phases can work simultaneously. C. Layout and Simulation Result Fig. 18 shows the layout of the proposed FBMMC-TCAM array, where four NAND- and NOR-type TCAMs are well grouped by multi-mode cascaded units. The proposed FBMMC- TCAM is implemented with TSMC 40 nm 1P9M CMOS technology at 25 C in a typical design-corner. The post-layout result of the FBMMC-TCAM shows that the energy metric is only fj/bit/search, with 50% mask data, and toggled search patterns. Additional logic gates, global routing, buffers increase the energy metric compared with the single-search mode of the hybrid-pipelined TCAM. The maximum search frequency is 416 MHz. Because longer length of TCAM consumes more power consumption, the proposed FBMMC-TCAM can adjust the length of the filter according to the requirement of the mainsearch component. Moreover, the match information of each filter and the main-search component is only 1 bit and the mode information is shared by the flow table specification when required, resulting in low area overhead. In this design, the proposed FBMMC-TCAM can achieve a multiple-table option without wasting much of the unused area. The area and timing overheads of the proposed FBMMC architecture, shown in Table IX, are relatively low. In addition, every hybrid-pipelined TCAM is properly utilized without wasting any storage and search components. This design concept enables easy extension to any mode higher than the triple-search mode, with the use of additional multiplexers and basic logic gates. However, a triple-search mode is sufficient for current OpenFlow specifications. Such an area-efficient reconfigurable design is well suited for multiple-table processing engines. V. C OMPARISON WITH OTHER RELATED WORKS When applying traditional TCAM designs with word length insufficient for multiple-field packet classification, we need to apply an additional packet algorithm to handle multiple match fields. Most packet classification algorithms used in TCAM-based solutions are decision-tree-based [25], [26] and decomposition-based [11], [27] algorithms. Decision-treebased approaches cut the search space recursively into smaller

1670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 16. Mode-2 of the proposed FBMMC-TCAM. TABLE IX AREA AND TIMING OVERHEAD OF FBMMC-TCAM Fig. 17.

However, the performance of decision-tree-based approaches also depends on the rule set. A cut in one match field leads to duplicated rules in other match fields, resulting in rule set expansion [26].

10 1670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 Fig. 16. Mode-2 of the proposed FBMMC-TCAM. TABLE IX AREA AND TIMING OVERHEAD OF FBMMC-TCAM Fig. 17. Block diagram of the proposed FBMMC-TCAM module. Fig. 18. Layout of the FBMMC TCAM array. subspaces according to a rule set. However, the performance of decision-tree-based approaches also depends on the rule set. A cut in one match field leads to duplicated rules in other match fields, resulting in rule set expansion [26]. Decomposition-based approaches first search each header field individually, and then merge partial results into the final result. This approach requires an additional hardware module to resolve hash collisions. The above techniques split a flow table with long word lengths into several flow tables with shorter word lengths. In general, a table with a 568-bit rule length is split into at least 4 to 12 flow tables [25], which causes a several times greater search latency than in case of single flow tables. Table X shows a comparison of the proposed TCAM with other TCAM architectures. The normalized energy metric defined in [14] is used to have a fair comparison as follows: ( E = E 40 Technology ) ( 0.9 V DD ) 2 (9) where E represents energy metric (fj/bit/search) and E denotes the energy metric normalized for 40 nm CMOS technology and a supply voltage of 0.9 V. When applying related TCAM designs to an OpenFlow switch, although power consumption of Butterfly [14] and LVMLSS TCAMs [15] is low, they did not consider long word lengths and reconfigurable architecture; resulting in not only high latency coming from split table but inflexible table configuration. Since the proposed dynamic reconfigurable TCAM can support long and variable word lengths, the search latency is several times less than other TCAM designs that are insufficient for long and variable word lengths. Moreover, this is achieved without the need to split the table or an additional hardware module to deal with multiple split tables. Although 4-parallel TCAM design [24] can support long and variable word lengths with flexible priority encoder, however, priority encoder is not suitable for many-field classification; moreover, the energy metric is ten times more than the FBMMC-TCAM. The proposed FBMMC-TCAM takes advantage of hierarchical MLs and a S-PG technique to significantly reduce power consumption. Our design can achieve both long and variable word lengths, with low static and dynamic power consumption and high throughput rate. The proposed TCAM

11 CHEN et al.: DYNAMIC RECONFIGURABLE TCAM FOR OPENFLOW-COMPLIANT LOW-POWER PACKET PROCESSING 1671 TABLE X SUMMARY OF DIFFERENT TERNARY CONTENT-ADDRESSABLE MEMORY design is very suitable for the emerging high performance packet processing in future SDN applications that need long and variable word lengths. VI. CONCLUSION We developed a dynamic reconfigurable TCAM for OpenFlow-compliant low-power packet processing. The proposed FBMMC architecture can configure variable word lengths using relatively low area and timing overheads. The proposed hybrid-pipelined structure can achieve lower dynamic power consumption and high throughput rate. Moreover, the proposed self-power gating technique lowers static power consumption without additional control and global routing overheads. Compared with other reconfigurable architectures, the proposed FBMMC-TCAM enables a considerable reduction in power consumption. The proposed TCAM design is very suitable for the emerging high performance packet processing in future SDN applications that need long and variable word lengths. ACKNOWLEDGMENT The authors would like to thank National Chip Implementation Center (CIC), National Applied Research Laboratories (NARL), Taiwan for technical support in layout. REFERENCES [1] N. Mckeown, Software-Defined Networking, INFOCOM Keynote Talk, Rio de Janeiro, Brazil, Apr [2] S. Jain et al., B4: Experience with a globally-deployed software defined wan, in Proc. ACM SIGCOMM 2013, pp. 3 14, [3] F. Hu, Q. Hao, and K. Bao, A survey on software defined networking (SDN) and OpenFlow: From concept to implementation, IEEE Commun. Surv. Tut., vol. 16, no. 4, pp , 4th Quart [4] ONF: OpenFlow Switch Specification Version [Online]. Available: [5] K. Pagiamtzis and A. Sheikholeslami, Content-addressable memory (CAM) circuits and architectures: A tutorial and survey, IEEE J. Solid- State Circuits, vol. 41, no. 3, pp , Mar [6] P. Gupta and N. McKeown, Algorithms for packet classification, IEEE Network, vol. 15, no. 2, pp , Mar./Apr [7] D. J. Craft, A fast hardware data compression algorithm and some algorithmic extensions, IBM J. Res. Develop., vol. 42, no. 6, pp , Nov [8] M. Nakanishi and T. Ogura, Real-time CAM-based Hough transform and its performance evaluation, Mach. Vis. Appl., vol. 12, no. 2, pp , Aug [9] P.-F. Lin and J. Kuo, A 1-V 128-kb four-way set-associative cmos cache memory using wordline-oriented tag-compare (WLOTC) structure with the content-addressable-memory (CAM) 10-transistor tag cell, IEEE J. Solid-State Circuits, vol. 36, no. 4, pp , Apr [10] C.-C. Wang, C.-J. Cheng, T.-F. Chen, and J.-S. Wang, An adaptively dividable dual-port BiTCAM for virus-detection processors in mobile devices, IEEE J. Solid-State Circuits, vol. 44, no. 5, pp , May [11] Y. Qu, S. Zhou, and V. Prasanna, Scalable many-field packet classification on multi-core processors, in Proc. Int. Symp. Comput. Archit. High Perform. Comput.(SBAC-PAD), Oct. 2013, pp [12] Z. Cai, Z. Wang, K. Zheng, and J. CAO, A distributed TCAMcoprocessor architecture for integrated longest prefix matching, policy filtering, and content filtering, IEEE Trans. Comput., vol. 62, no. 3, pp , Mar [13] H.-J. Tsai, K.-H. Yang, Y.-C. Peng, C.-C. Lin, Y.-H. Tsao, M.-F. Chang, and T.-F. Chen, Energy-efficient non-volatile TCAM search engine design using priority-decision in memory technology for DPI, in Proc. DAC, 2015, pp [14] P.-T. Huang and W. Hwang, A 65 nm fj/bit/search 256x144TCAM macro design for IPv6 lookup tables, IEEE J. Solid-State Circuits, vol. 46, no. 2, pp , Feb [15] I. Hayashi, T. Amano, N. Watanabe, Y. Yano, Y. Kuroda, M. Shirata, K. Dosaka, K. Nii, H. Noda, and H. Kawai, A 250-MHz 18-Mb full ternary CAM with low-voltage matchline sensing scheme in 65-nm CMOS, IEEE J. Solid-State Circuits, vol. 48, no. 11, pp , Nov [16] Y.-J. Chang, K.-L. Tsai, and H.-J. Tsai, Low leakage TCAM for IP lookup using two-side self-gating, IEEE Trans. Circuits Sys. I, Reg. Papers, vol. 60, no. 6, pp , Jun

1672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 [17] Y.-J. Chang, Using the dynamic power source technique to reduce TCAM leakage power, IEEE Trans.

Min, J.-M. Oh, and H.-J. Kang, A low power content addressable memory using low swing search lines, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 12, pp. 2849 2858, Dec. 2011. [19] K.

[20] B.-D. Yang and L.-S. Kim, A low-power CAM using pulsed NAND- NOR match-line and charge-recycling search-line driver, IEEE J. Solid- State Circuits, vol. 55, no. 6, pp. 1736 1744, Aug. 2005.

58-fJ/Bit/ search 1-GHz ternary content addressable memory compiler using siliconaware early-predict late-correct sensing with embedded deep-trench capacitor noise mitigation, IEEE J.

12 1672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 63, NO. 10, OCTOBER 2016 [17] Y.-J. Chang, Using the dynamic power source technique to reduce TCAM leakage power, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 11, pp , Nov [18] B.-D. Yang, Y.-K. Lee, S.-W. Sung, J.-J. Min, J.-M. Oh, and H.-J. Kang, A low power content addressable memory using low swing search lines, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 12, pp , Dec [19] K. Pagiamtzis and A. Sheikholeslami, A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme, IEEE J. Solid-State Circuits, vol. 39, no. 9, pp , Sep [20] B.-D. Yang and L.-S. Kim, A low-power CAM using pulsed NAND- NOR match-line and charge-recycling search-line driver, IEEE J. Solid- State Circuits, vol. 55, no. 6, pp , Aug [21] I. Arsovski, T. Hebig, D. Dobson, and R. Wisort, A 32 nm 0.58-fJ/Bit/ search 1-GHz ternary content addressable memory compiler using siliconaware early-predict late-correct sensing with embedded deep-trench capacitor noise mitigation, IEEE J. Solid-State Circuits, vol. 48, no. 4, pp , Apr [22] Deering and R. Hinden, RFC 1883: Internet Protocol Version 6 (IPv6) Specification, Internet Engineering Task Force (IETF), Dec [23] B.-D. Yang, Low-power effective memory-size expanded TCAM using data-relocation scheme, IEEE J. Solid-State Circuits, vol. 50, no. 10, pp , Oct [24] K. Nii, T. Amano, N. Watanabe, M. Yamawaki, K. Yoshinaga, M. Wada, and I. Hayashi, A 28 nm 400 MHz 4-parallel 1.6 Gsearch/s 80 Mb ternary CAM, in 2014 IEEE ISSCC Dig. Tech. Papers, Feb. 2014, pp [25] W. Jiang and V. K. Prasanna, Scalable packet classification on FPGA, IEEE Trans. VLSI Syst., vol. 20, no. 9, pp , Sep [26] P. Gupta and N. McKeown, Classifying packets with hierarchical intelligent cuttings, IEEE Micro, vol. 20, no. 1, pp , Jan./Feb [27] V. Pus and J. Korenek, Fast and scalable packet classification using perfect hash functions, in Proc. ACM/SIGDA Int. Symp. FPGA, 2009, pp [28] P. T. Congdon, P. Mohapatra, M. Farrens, and V. Akella, Simultaneously reducing latency and power consumption in OpenFlow switches, IEEE/ACM Trans. Networking, vol. 22, no. 3, pp , Jun [29] M. Ross, Content Addressable Memory (CAM) With Accesses to Multiple CAM Arrays Used to Generate Result for Various Matching Sizes, U.S. Patent , Feb. 25, Ting-Sheng Chen received the B.S. degree in electrical engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in He is currently working toward the Ph.D. degree from the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei. His research interests are in the areas of softwaredefined networking, low-power circuit design, and VLSI implementation of DSP algorithms. Ding-Yuan Lee received the B.S. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in He is currently working toward the Ph.D. degree in the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei. His research interests are in the areas of softwaredefined networking, arithmetic unit design, and VLSI implementation of DSP algorithms. Tsung-Te Liu received the B.S. and M.S. degrees from the National Taiwan University, Taipei, Taiwan, R.O.C., in 2002 and 2004, respectively, and the Ph.D. degree from the University of California, Berkeley, in 2012, all in electrical engineering. From 2004 to 2005, he was with MediaTek Inc., Taiwan, involved in circuit and system design for wireless communications. From 2005 to 2012, he was a member of the Berkeley Wireless Research Center (BWRC), University of California, Berkeley. From 2012 to 2014, he was with Interuniversity Microelectronics Centre (IMEC), Belgium, where he conducted research on circuit development for advanced CMOS technology. In 2014, he joined the faculty of the National Taiwan University, where he is currently an Assistant Professor of the Graduate Institute of Electronics Engineering and the Department of Electrical Engineering. His research interests involve energy-efficient circuit and system designs. He is the recipient of several design and teaching awards. An-Yeu (Andy) Wu (M 96 SM 12 F 15) received the B.S. degree from National Taiwan University, Taipei, Taiwan, R.O.C., in 1987, and the M.S. and Ph.D. degrees from the University of Maryland, College Park, in 1992 and 1995, respectively, all in electrical engineering. In August 2000, he joined the faculty of the Department of Electrical Engineering and the Graduate Institute of Electronics Engineering, National Taiwan University (NTU), where he is currently a Professor. His research interests include low-power/ high-performance VLSI architectures for DSP and communication applications, adaptive/multirate signal processing, reconfigurable broadband access systems and architectures, and System-on-Chip (SoC)/Network-on-Chip (NoC) platform for software/hardware co-design. He has published more than 190 refereed journal and conference papers in above research areas, together with five book chapters and 16 granted U.S. patents. From August 2007 to December 2009, he was on leave from NTU and served as the Deputy General Director of SoC Technology Center (STC), Industrial Technology Research Institute (ITRI), Hsinchu, Taiwan, supervising Parallel Core Architecture (PAC) VLIW DSP Processor and Multicore/Android SoC platform projects. Dr. Wu received the Outstanding Electrical Engineering Professor Award in 2010 from The Chinese Institute of Electrical Engineering (CIEE), Taiwan. He became an IEEE Fellow in 2015 for his contributions to DSP algorithms and VLSI designs for communication IC/SoC.

Filter-Based Dual-Voltage Architecture for Low-Power Long-Word TCAM Design

Filter-Based Dual-Voltage Architecture for Low-Power Long-Word TCAM Design Ting-Sheng Chen, Ding-Yuan Lee, Tsung-Te Liu, and An-Yeu (Andy) Wu Graduate Institute of Electronics Engineering, National Taiwan