Scalable Ternary Content Addressable Memory Implementation Using FPGAs

Size: px

Start display at page:

Download "Scalable Ternary Content Addressable Memory Implementation Using FPGAs"

Aubrie Collins
6 years ago
Views:

1 Scalable Ternary Content Addressable Memory Implementation Using FPGAs Weirong Jiang Xilinx Research Labs San Jose, CA, USA ABSTRACT Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-ofthe-art FPGAs. We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 5 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance trade-offs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than Mbits. Categories and Subject Descriptors C..4 [Processor Architectures]: Parallel Architectures; C.2.6 [Computer Communication Networks]: Internetworking General Terms Algorithms, Design, Performance words FPGA; RAM; TCAM. INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be,, or don t care (i.e. ). TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multi-field packet classification, etc. For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle. A priority encoder is needed to obtain the index of the matching word with the highest priority. In a TCAM, the physical location normally determines the priority, e.g. the top word has the highest priority. Most of current TCAMs are implemented as a standalone application-specific integrated circuit (ASIC). We call them the native TCAMs. Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs). The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g. OpenFlow []) where the width and/or the depth for different lookup tables can be variable [2]. Various algorithmic solutions have been proposed as alternatives to native TCAMs. But none of them is exactly equivalent to TCAM. The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM. For some other search functions such as multi-field packet classification, the algorithmic solutions [3, 4] employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set. On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the near-asic performance. State-of-theart FPGA devices such as Xilinx Virtex-7 [5] and Altera Stratix-V [6] provide high clock rate, low power dissipation, rich on-chip resources and large amounts of embedded memory with configurable word width. Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions [7, 8, 9, ]. Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC. Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAM-equivalent search engines. While there exist several FPGA-based TCAM designs, most of them are based on brute-force implementations to mimic the native TCAM architecture. Their resouce usage is inefficient, which makes them less interesting in practice. On the other hand, some recent work [, 2, 3] shows that RAMs can be employed to emulate/implement a TCAM /3/$3. 23 IEEE 7

2 But none of them gives a correctness proof or a thorough study for efficient FPGA implementation. Their architectures are monolithic, which do not scale well in implementing large TCAMs. A goal of this paper is to advance the FPGA-based TCAM designs by investigating both theory and architecture for RAM-based TCAM implementation. The main contributions include: We give an in-depth introduction to the RAM-based TCAM. We formalize the key ideas and the algorithms behind it. We analyze thoroughly the theoretical performance of the RAM-based TCAM and identify the key challenges in implementing a large RAM-based TCAM. We propose a modular and scalable architecture that consists of arrays of small-size RAM-based TCAM units. By decoupling the update logic from each unit, such a modular architecture enables each update engine to be shared among multiple units. Thus the logic resource is saved. We share our experience in implementing the proposed architecture on a state-of-the-art FPGA. The post place and route results show that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 5 million packets per second (Mpps). To the best of our knowledge this is the first FPGA design that implements a TCAM larger than Mbits. We conduct comprehensive experiments to characterize the various performance trade-offs offered by the configurable architecture. We also discuss the support of range matching without range-to-ternary conversion. The rest of the paper is organized as follows. Section 2 gives a detailed introduction to the theoretic aspects of the RAM-based TCAM. Section 3 discusses the hardware architectures for scalable RAM-based TCAM. Section 4 presents the comprehensive evaluation results based on the implementation on a state-of-the-art FPGA. Section 5 reviews the related work on FPGA-based TCAM designs. Section 6 concludes the paper. 2. RAM-BASED TCAM 2. Terminology We first have the following definitions: The Depth of a TCAM (or RAM) is the number of words in the TCAM (or RAM). Denoted as N. The Width of a TCAM (or RAM) is the width (i.e. the number of bits) of each TCAM (or RAM) word. Denoted as W. The Size of a TCAM (or RAM) is the total number of bits of the TCAM (or RAM). It equals N W. The address width of a RAM is the number of bits of the RAM address. Denoted as d. Note that N = 2 d for a RAM. We describe the organization of a TCAM or RAM as Depth Width, i.e., N W. For example, a 2 RAM consists of 2 words where each word is -bit. We call a TCAM or RAM wide (or narrow) if its width is large (or small). We call a TCAM or RAM deep (or shallow) if its depth is large (or small). We also have the notation as shown in Table : Notation k t A A n s Table : Notation Description An input key or a binary number A ternary word An alphabet for -bit characters The set of all n-bit strings over A The length of a string s A n. s = n 2.2 Main Ideas A TCAM can be divided into two logical areas: () TCAM words and (2) priority encoder. Each TCAM word consists of a row of matching cells attached to a same matching line. During lookup, each input key goes to all the N words in parallel and retrieves a N-bit match vector. The i-th bit of the match vector indicates if the key matches the i-th word, i =, 2,, N. In this section, for ease of discussion we consider a TCAM without priority encoder. Thus the output of the considered TCAM is a N-bit match vector instead of the index of the matching word with the highest priority. Looking up a N W TCAM is basically mapping a W - bit binary input key into a N-bit binary match vector. The same mapping can be achieved by using a 2 W N RAM where the W -bit input key is used as the address to access the RAM and each RAM word stores a N-bit vector. Figure (a) shows a TCAM and its corresponding RAM-based implementation. As the TCAM word stores a don t care bit, the match vector is always no matter the input -bit key is or Depth Extension The depth of a native TCAM is increased by stacking vertically words with the same width. Correspondingly in the RAM-based implementation, the depth of a TCAM is extended by increasing the width of the RAM. Each column of the RAM represents the match vector for a word. Figure (b) shows a 2 TCAM which adds a word to the TCAM shown in Figure (a). Correspondingly the RAM-based implementation adds a column to the RAM shown in Figure (a). We see that the memory requirement of either the native TCAM or its RAM-based implementation is linear with the depth. We can also view the depth extension as concatenating the match vectors from multiple shallower TCAMs. For instance, a N W TCAM can be horizontally divided into two TCAMs: one is N W and the other N 2 W, where N = N + N 2. Then there are two RAMs in the corresponding RAM-based implementation: one is 2 W N and the other 2 W N 2. The outputs of the two RAMs are concatenated to obtain the final N-bit match vector. This is essentially equivalent to building a wider RAM by concatenating two RAMs with the same depth. For the sake of 72

3 Native TCAM RAM Native TCAM RAM (a) (b) Match * * [] [] * [] [] Figure 2: Building a 2 TCAM using two TCAMs. (c) [] [] * Table 2: Representing a ternary bit in RAM The value of The value stored at the ternary bit RAM[] RAM[] don t care Figure : (a) Matching a -bit key with a TCAM; (b) Matching a -bit key with a 2 TCAM; (c) Matching a 2-bit key with a 2 TCAM. simplicity, we consider the wide RAM built based on concatenating multiple RAMs as a single RAM Width Extension A wider TCAM deals with a wider input key. When implementing the TCAM in a single RAM, a wider input key (which is used as the address to access the RAM) indicates a wider address width for the RAM. This results in a deeper RAM whose depth is 2 W. Figure (c) shows a 2 TCAM which extends the width of the TCAM shown in Figure (a). As the width of the input key is increased by bit, the depth of the RAM in the corresponding RAM-based TCAM gets doubled. Such a design cannot scale well for wide input keys. An alternative solution is using multiple narrow TCAMs to implement a wide TCAM. For example, a N W TCAM can be vertically divided into two TCAMs: one is N W and the other N W 2, where W = W +W 2. During lookup, a W -bit input key is divided into two segments accordingly: one is W -bit and the other W 2-bit. Each of the two narrower TCAMs matches the corresponding segment of the key and outputs a N-bit match vector. The two match vectors are then bitwise ANDed to obtain the final match vector. The two narrow TCAMs map to two shallow RAMs in the corresponding RAM-based implementation. The total memory requirement becomes 2 W + 2 W 2 instead of 2 W =2 W 2 W 2. Figure 2 shows how a 2 TCAM is built based on two TCAMs Populating the RAM Given a set of ternary words, we need to populate the RAMs so that the RAM-based implementation can fulfill the same search function as the native TCAM. As shown in Figure (a), it is easy to populate the RAM for the RAMbased implementation of a TCAM. Table 2 shows the content of the 2 RAM populated for the TCAM, where RAM[k] denotes the RAM word at the address k, k =,. Principle shows the principle in populating the 2 RAM to represent a -bit TCAM where k {, } and t {,, }. Principle. RAM[k]= if and only if k matches t; otherwise RAM[k]=. Theorem. The RAM populated following Principle achieves the equivalent function as the TCAM that stores t. Proof. In the TCAM, the output for an input k is k matches t. Otherwise, the output is. In the populated RAM, the output for an input k is RAM[k]=. Otherwise, the output is. Thus the populated RAM is equivalent to the represented TCAM. Both Principle and Theorem are directly applicable to the case of W TCAM where k A W, A = {, } and t A W, A = {,, }. Principle can be extended to the case of a N W TCAM that is implemented in a 2 W N RAM. Let RAM[k][i] denote the i-th bit of the k-th word in the RAM. Let t i denote the i-th word in the TCAM. i =, 2,, N. So we have Principle 2 for populating the 2 W N RAM to represent a N W TCAM, where k A W, A = {, } and t i A W, A = {,, }. Principle 2. RAM[k][i]= if and only if k matches t i ; otherwise RAM[k][i]=. i =, 2,, N. When a wide TCAM is built using multiple narrower TCAMs, the RAM corresponding to each narrow TCAM is populated individually by following Principle Algorithms and Analysis This section formalizes the algorithms for using a RAMbased TCAM that is built according to the discussion in Section General Model Based on the discussion in Section 2.2.2, a N W TCAM can be constructed using P narrow TCAMs, P =, 2,, W. 73

4 The size of the i-th TCAM is N W i, i =, 2,, P, and W = P i= Wi. Let RAMi denote the RAM corresponding to the i-th narrow TCAM, i =, 2,, P. The size of RAM i is 2 W i N, i =, 2,, P. Hence the N W TCAM can be implemented using these P RAMs Lookup Algorithm shows the algorithm to search a key over the RAM-based TCAM. It takes O() time to access each RAM. Since the P RAMs are accessed in parallel, the overall time complexity for lookup is O(). Algorithm Lookup Input: A W -bit key k. Input: {RAM i}, i =, 2,, P. Output: A N-bit match vector m. : Divide k into P segments: k {k, k 2,, k P }. k i = W i, i =, 2,, P. 2: Initialize m to be all s: m. 3: for i to P do {bitwise AND} 4: m m & RAM i[k i] 5: end for Update Updating a TCAM can be either adding or deleting a specific TCAM word. Algorithm 2 shows the algorithm to add or delete the n-th word of the TCAM in the RAM-based implementation, where n =, 2,, N. It takes O(2 W i ) time to update the RAM i, i =, 2,, P. As the P RAMs are updated in parallel, the overall time complexity for update is determined by the RAM that takes the longest time for update, which is O(max P i= 2 W i ) = O(2 maxp i= W i ). Algorithm 2 Updating a TCAM word Input: A W -bit ternary word t. Input: The index of t: n. Input: The update operation: op {add, delete}. Output: Updated {RAM i}, i =, 2,, P. : Divide t into P segments: t {t, t 2,, t P }. t i = W i, i =, 2,, P. 2: for i to P do {Update each RAM} 3: for k to 2 W i do 4: if k matches t i and op == add then 5: RAM i[k][n] = 6: else 7: RAM i[k][n] = 8: end if 9: end for : end for Space Analysis The size of RAM i is 2 W i N, i =, 2,, P. Hence the overall memory requirement is P i= (2W i N) = N P To minimize the overall memory requirement, we formulate the problem as: min P 2 W i = min W i= ( min P = {W,W 2,,W P } i= i= 2W i. P 2 W i ) () subject to P W i = W (2) i= For a given P, min P {W,W 2,,W P } i= 2W i = P 2 W P when W i = W, i =, 2,, P. Hence the overall memory requirement is minimum when all the P RAMs have the same P address width, denoted as w = W. The depth of each RAM P is 2 w. Then the overall memory requirement is P (2 Wi N) = i= W w i= (2 w N) = W w 2w N = NW 2w w We define the RAM/TCAM ratio as the number of RAM bits needed to implement a TCAM bit. According to Equation (3), the RAM/TCAM ratio is 2w when all the RAMs w employ the same address width of w. Basically a larger w results in a larger RAM/TCAM ratio, which indicates lower memory efficiency. The minimum RAM/TCAM ratio is 2 when w = (P = W ) or w = 2 (P = W/2). In other words, when the depth of each RAM is 2 (w = ) or 4 (w = 2), the overall memory requirement achieves the minimum, which is 2NW, i.e. twice the size of the corresponding native TCAM. (3) Comparison with Native TCAM Table 3 summarizes the difference between the native TCAM and its corresponding implementation using P RAMs, with respect to time and space complexities. Here we consider all the RAMs employ the same address width (w), so that both the update time and the space complexities achieve the optimum for the RAM-based TCAM (as discussed in Sections and 2.3.4). Table 3: Native TCAM vs. RAM-based TCAM Native TCAM RAM-based TCAM Lookup time O() O() Update time O() O(2 w ) Space NW 2 w NW w 3. HARDWARE ARCHITECTURE We are interested in implementing the RAM-based TCAM on FPGA. While the theoretical discussion in Section 2 excludes the priority encoder, the hardware architecture of the RAM-based TCAM must consider the priority encoder. 3. Basic Architecture The theorectical model of the RAM-based TCAM implementation (discussed in Section 2.3.) can be directly mapped to the hardware architecture shown in Figure 3. A N W TCAM is implemented using P RAMs where the size of the i-th RAM is 2 W i N and P i= Wi = W. As illustrated in Algorithm, a lookup is performed by dividing the input W -bit key into P segments where the length of the i-th segment is W i, i =, 2,, P. Then each segment of the key is used as the address to access the corresponding RAM. Each RAM outputs a N-bit vector. The P N-bit vectors are then bitwise ANDed to generate the final match vector. The match vector is finally fed into 74

5 W W W 2 W P Addr_in Data_out RAM Addr_in Data_out RAM 2 Addr_in Data_out RAM P N N N N Priority Encoder ID Match Figure 3: Basic architecture (without update logic) the priority encoder to obtain the index of the matching word with the highest priority. A -bit Match signal is also generated to indicate if there is any match. We add update logic to the RAM-based TCAM so that it can complete any update by itself at run time. In accordance with Algorithm 2, Figure 4 shows the logic for updating the RAM-based TCAM, where W max = max P i= W i. We use two W -bit binary numbers, denoted as Data and Mask, to represent a W -bit ternary word t that is to be updated. The i-th bit of t is don t care bit if and only if the i-th bit of Mask is set to be, i =, 2,, W. For example, the 2-bit ternary word can be represented by: Data= or, and Mask=. Id specifies the index of the ternary word to be updated. Op indicates if the ternary word is to be added (Op = ) or deleted (Op = ). W max -bit counter Data W W Mask W Id Op W W W 2 2 W P W P + CMP CHG + CMP CHG + CMP CHG Addr_R Addr_W We Data_out Data_in RAM Addr_R Addr_W We Data_out Data_in RAM 2 Addr_R Addr_W We Data_out Data_in RAM P Figure 4: Update logic. CMP : Compare. CHG : Change. Adding or deleting the Id-th ternary word t is accomplished by setting or clearing the Id-th bit from all the RAM words whose addresses match t. Meanwhile we must keep the rest of the bits of these RAM words unchanged. Hence we need to read the original content of the RAM words, change only the Id-th bit and then write the updated RAM words back to the RAM. This requires 2 2 w clock cycles to update a single-port RAM whose address width is w. To reduce the update latency, we utilize a simple dual-port RAM and perform the read and write at the same clock cycle. A simple dual-port RAM has two input ports and one output port. One input port is for Read only and the other is for Write only. At each clock cycle during update, the update logic writes the updated RAM word to the address k while reading the content of the RAM word at the address k +. Hence the update latency becomes 2 w + clock cycles where the clock cycle is consumed to fetch the content of the first RAM word. Another part of the update logic is a state matchine (not shown in Figure 4) that switches the state of the TCAM between lookup and update. During update, no lookup is permitted and any match result is invalid. 3.2 Modular Architecture In implementing a large-scale RAM-based TCAM on FPGA, there are two main challenges: Throughput: When the TCAM is deeper or wider, the logic and the routing complexities become larger, especially for bitwise-anding a lot of wide bit vectors and for priority encoding a deep match vector. This results in significant degradation in the achievable clock rate, which determines the maximum throughput of the RAM-based TCAM. Resource usage: The on-chip resource of a FPGA device is limited. Hence we must optimize the architecture to save the resource or use the resource efficiently. We need to find out the best memory configuration based on the physical capability. It is also desirable to enable resource sharing between subsystems. We propose a scalable and modular architecture that employs configurable small-size RAM-based TCAM units as building blocks. Both bit vector bitwise-anding and priority encoding are performed in a localized and pipelined fashion so that high throughput is sustained for large TCAMs. We decouple the update logic from each unit so that a single update engine can be shared flexibly by multiple TCAM units. On-chip logic resources are thus saved. Note that such resource sharing is only possible in a modular architecture Overview The top-level design consists of a grid of units, which are organized in multiple rows. Figure 5 shows the top-level architecture with R rows each of which contains L units. The TCAM words with higher priority are stored in the units with lower index. The units within a row are searched sequentially in a pipelined fashion. Priority is resolved locally within each unit. After each row outputs a matching result, a global priority encoder is needed to select the one with the globally highest priority Unit Design A TCAM unit is basically a U W TCAM implemented in RAMs, where U is the number of TCAM words per unit. Figure 6 depicts the architecture of a TCAM unit. Each unit performs the local TCAM lookup and combines the local match result with the result from the preceding unit. As 75

6 _in _in _in _out _out _out Unit Unit Unit L- _in _out _in _out _in _out Unit L Unit L+ Unit 2L- Priority Encoder Matching ID _in _out Unit (R-)L _in _out Unit (R-)L+ _in _out Unit RL- Figure 5: Top-level architecture the unit index determines the priority, a matching TCAM word stored in the preceding units always has a higher priority than the local matching one. The U W TCAM is constructed using P RAMs based on the basic architecture shown in Section 3.. We use the same address width w for all the P RAMs to achieve the maximum memory efficiency as discussed in Section _in W Match U W TCAM (in RAM) ID MUX Figure 6: A Unit _out When W is large, there are many RAMs each of which outputs a U-bit vector. The throughput may degrade in bitwise-anding a large number of bit vectors. We divide a unit into multiple pipelined stages. Let H denote the number of stages in a unit. Then each stage contains P H RAMs. Within each stage, the bit vectors generated by the P RAMs are bitwise-anded. The resulting U-bit vector is H combined with the bit vector passed from the previous stage and then passed to the next stage. The last stage of the unit performs the local priority encoding Update Engine We make the following observations: Updating a TCAM word involves updating only one unit. Update logic is identical for the units with the same memory organization. To save logic resource, it is desirable to share the update logic between units. We decouple the update logic from units and build multiple update engines. Each update engine contains the update logic and serves multiple units. An update engine maintains a single state machine and decides which unit to be updated based on the index (Id) of the TCAM word to be updated. A unit receives from its update engine the addresses and the write enable signals for its RAMs. The unit also interacts with its update engine to exchange the bit vectors to update each RAM word. Due to the decoupling of the update logic from the units, the association between the lookup units (LUs) and the update engines (s) is flexible. The only constraint is that the units served by the same update engine must have the same memory organization (i.e. P and w). Figure 7 shows three different example layouts of the update engines in a 4-row, 4-unit-per-row architecture. 3.3 Explicit Range Match In some network search applications such as access control list (ACL), a packet is matched against a set of rules. An ACL-like rule specifies the match condition on each of the multiple packet header fields. Some fields such as TCP ports are specified using ranges rather than ternary strings. Taking 5-field ACL as an example, the two 6-bit port fields are normally specified in ranges. The ranges must be converted into ternary strings so that such rules can be stored in a TCAM. However, a range may be converted into multiple ternary strings. A r-bit range can be expanded to 2(r ) prefixes or 2(r 2) ternary strings. If there are D of such fields in a rule, then this rule can be expanded to (2r 4) D ternary words in the worst case. Such a problem is called rule expansion [4]. Various range encoding methods have been proposed to minimize rule expansion. Even with the optimal range encoding [4], it needs r ternary words to 76

7 (a) Row (b) Column (c) Square Figure 7: Example layouts of update engines (s). LU: lookup unit. will be mapped to a 2 d min W physical RAM. Thus the RAM/TCAM ratio becomes 2max(w,d min ) instead of 2w. w w A trick that can be played is to map multiple shallow (logical) RAMs to a deep physical RAM. For example, two 2 d W (logical) RAMs can be mapped to a single 2 d+ W physical RAM. But the throughput will be halved unless the physical RAM has two sets of input/output ports used independently for the two (logical) RAMs. While some multi-port RAM designs [6] are available, they bring extra complications and are beyond the scope of this paper. Hence when implementing the RAM-based TCAM in real hardware, the address width of each RAM, i.e. w, should be carefully chosen based on the available physical configuration. 4. PERFORMANCE EVALUATION We implement our modular RAM-based TCAM architecture on a Xilinx Virtex 7 XC7V2T device with -2 speed grade. We evaluate the performance based on the post place and route results from the Xilinx Vivado 23. development toolset. To recap, we list the key parameters of the architecture in Table 4. Note that N = R L U. Table 4: Architectural parameters Parameter Description N TCAM depth W TCAM width R The number of rows L The number of units per row U The number of TCAM words per unit H The number of stages per unit w The address width of the RAM represent a r-bit range. In such a case, a rule with D range fields will occupy O(r D ) TCAM words. An attractive advantage of FPGA compared with ASIC is that we can reprogram the hardware on-the-fly to add customorized logic. So for the ACL-like search problems, we adopt the similar idea as [5] to augment the TCAM design with explicit range match support instead of converting ranges into ternary strings. This is achieved by storing the lower and upper bounds of each range explicitly in registers. Hence, if there are N rules each containing D r-bit port fields, then we require totally N D r 2 bits of registers to store the lower and the upper bounds of all the ranges. On the other hand, the size of the TCAM that needs to be stored in RAMs is reduced to N (W D r). 3.4 Mapping to Physical Hardware According to the theoretical analysis in Section 2.3.4, the RAM-based implementation of a N W TCAM requires the minimum memory when employing shallow RAMs with the same depth of 2 or 4. However, real hardware has limitations on the minimum depth of physical RAMs. For example, each block RAM (BRAM) available on a Xilinx Virtex 7 FPGA can be configured as 52 72, K 36, 2K 8, 4K 9, 8K 4, 6K 2, or 32K, in simple dual-port mode. In other words, the minimum depth for a BRAM is 52=2 9. Let d min denote the minimum address width of the physical RAM. A N W (logical) RAM where N 2 d min 4. Analysis and Estimation Due to its pipelined architecture, our RAM-based TCAM implementation processes one packet every clock cycle. Thus the throughput is F million packets per second (Mpps) when the clock rate of the implementation achieves F MHz. During lookup, each packet traverses the R rows in parallel. It takes L H clock cycles to go through each row. One clock cycle is needed for final priority encoding when the architecture consists of more than one rows. Thus the lookup latency in terms of the number of clock cycles is = { L H if R = L H + if R >. The address width of RAMs, i.e. w, is a critical parameter in our RAM-based TCAM. The update latency is 2 w + while the memory requirement for implementing a N W TCAM is 2w NW. To determine the optimal w, w we examine the physical memory resource available on the FPGA device. There are two types of memory resources in Xilinx Virtex FPGAs: distributed RAM and block RAM (BRAM). While BRAMs are provided as standalone RAMs, distributed RAM is coupled with logic resources. The basic logic resource unit of FPGA is usually called a Slice. Only a certain type of Slice, named SliceM, can be used to build the distributed RAM. As required by our architecture, we consider the RAMs only in simple dual-port (SDP) mode. Table 5 summarizes the total amount and the minimum ad- 77

8 Throughput (Mpps) Memory (Kbits) # words (N) L=4 L=8 L=6 L= # words (N) % 8% 6% 4% 2% % L=4 L=8 L=6 L=32 Utilization Power (Watts) # Slices # words (N) L=4 L=8 L=6 L= # words (N) % 8% 6% 4% 2% % L=4 L=8 L=6 L=32 Utilization Figure 8: Increasing the TCAM depth (N) dress width (d min) of the memory resource available on our target FPGA device. Table 5: Memory resource on a XC7V2T RAM type (in SDP mode) Total size (bits) d min Distributed RAM BRAM Either distributed RAM or BRAM can be employed to implement the RAM-based TCAM architecture. In either case, we set w=d min of the employed RAM type to achieve the highest memory efficiency. Based on the information from Table 5 we can estimate the maximum size of the TCAM that can be implemented on the target device. When the architecture is implemented using distributed RAM, the RAM/TCAM ratio is 25 and the maximum TCAM size is =2586 bits. When using BRAM, the RAM/TCAM 32/5 ratio is and the maximum TCAM size is = /9 bits. We can see that, though the total amount of BRAM bits is nearly the triple of that of distributed RAM bits, BRAM-based implementation supports a much smaller TCAM due to the higher RAM/TCAM ratio. Moreover, the update latency of distributed RAM-based implementation is 33 clock cycles, while the update latency for BRAM-based implementation is 53 clock cycles. Hence in most of our experiments, distributed RAMs (w = 5) are employed. Also note that our architecture is modular where each unit may independently select the RAM type for TCAM implementation. Thus the maximum TCAM size would be bits when both distributed RAMs and BRAMs are utilized. 4.2 Scalability We are interested in how the performance scales when the TCAM depth (i.e. N) or the TCAM width (W ) is increased. The key performance metrics include the throughput, the memory requirement, the power consumption estimates, and the resource usage. In these experiments, the default parameter settings are L = 4, U = 64, H =, and w = 5. Each unit contains its own update logic. First, we fix W = 5 and increase N by doubling R. Figure 8 shows the results where the memory and the Slices results are drawn using a logarithmic scale. The throughput is measured as As expected, the throughput degrades for a deeper TCAM. This is because a larger R results in a deeper final priority encoder which becomes the critical path. Also with a larger TCAM, the resource utilization approaches %. This makes it difficult to route signals, which further lowers the achievable clock rate. Fortunately because of the configurable architecture, we can trade the latency for throughput. Since N = R L U, we can increase L to reduce R for a given N while keeping other parameters fixed. As shown in Figure 8, a larger L results in a higher throughput, though it is at the expense of a larger latency. By tuning the latency-throughput trade-off, our design can sustain a 5 MHz clock rate for large TCAMs up to 6K 5 bits = 2.4 Mbits. Such a clock rate allows the design to process 5 million packets per second (Mpps) which translates to Gbps throughput for minimum-size Ethernet packets. Second, we fix the TCAM depth N = 496 and increase the TCAM width W. Figure 9 shows that a larger TCAM width results in a lower throughput. This is because there are W RAMs per unit where w = 5 in the implementation. w With a large W, it becomes time-critical to bitwise-and a large number of bit vectors within each unit. Again this can be amended by trading the latency for throughput. We increase the number of stages per unit so that each stage handles a smaller number of RAMs. As shown in Figure 9, the throughput is improved by increasing H by. This on the other hand increases the latency by L = 4 clock cycles. In both the above experiments, the resource usage is linear with the TCAM size (N W ). The estimated power con- 78

9 Throughput (Mpps) Memory (Kbits) Word width (W) H= H= Word width (W) 6% 5% 4% 3% 2% % % H= H=2 Utilization Power (Watts) # Slices Word width (W) H= H= Word width (W) 6% 5% 4% 3% 2% % % H= H=2 Utilization Figure 9: Increasing the TCAM width (W ) sumption is sublinear with the TCAM depth while is linear with the TCAM width. 4.3 Impact of Unit Size Each TCAM unit in our architecture stores U TCAM words. It is desirable to have a small U so that the local bit vector bitwise-anding and priority encoding within each unit do not become the critical path. On the other hand a smaller U leads to a larger L when R is fixed for a given N. Thus we can tune the latency-throughput trade-off by changing U. In this experiment, we fix R = 4, H = and vary U in implementing a 24 5 TCAM. As expected, Figure shows that a larger U results in a lower throughput as well as a lower latency. Such a trade-off can be exploited for some latency-sensitive applications where the latency is measured in terms of nanoseconds instead of the number of clock cycles. Based on the results shown in Figure, when U is doubled from 64 to 28, the throughput is slightly degraded while the latency is reduced from 6 5 = 3 ns to = 2 ns. The change of U has little impact on other performance metrics, which thus are not shown here. Throughput Unit size (U) Latency Throughput (Mpps) Latency (# of clock cycles) Figure : Increasing the unit size (U) 4.4 Distributed vs. Block RAMs As discussed in Section 4., distributed RAMs are more efficient than BRAMs in implementing the RAM-based TCAM on the target FPGA. But usually it is desirable to integrate the RAM-based TCAM with other engines (such as a packet parser) in a single FPGA device to comprise a complete packet processing system. Then the choice of the RAM type may depend on not only the efficiency but also the resource budget. BRAMs will be preferred to implement the RAM-based TCAM in case the other engines require a lot of Slices but few BRAMs. Hence we conduct experiments to characterize the performance of the RAMbased TCAMs implemented using the two different RAM types. In these experiments, W = 5, L = 4, U = 64, and H =. Each TCAM unit contains its own update logic. As shown in Table 6, distributed RAM-based implementations achieve higher clock rates and lower power consumption than BRAM-based implementations. This is due to the fact that a BRAM is deeper and larger, and thus requires longer access time and dissipates more power than a distributed RAM. Because distributed RAMs are based on Slices (SliceM), the distributed RAM-based implementations require much more logic resource (in terms of Slices) than BRAM-based implementations. 4.5 Impact of Update Engine Layout As discussed in Section 3.2.3, we can have flexible associations between lookup units and update engines by decoupling the update logic from each unit. We conduct experiments to evaluate the impact of different update engine () layouts on the performance of the architecture. The evaluated update engine layouts include: All: Each unit contains its own update logic. Square: The four neighboring units forming a square share the same update engine (Figure 7(c)). 79

10 Table 6: Implementation results based on different RAM types TCAM size: N W 24 5 bits bits bits RAM type Distributed Block Distributed Block Distributed Block Throughput (Mpps) # of Slices (Utilization) (6.72%) (3.97%) (3.8%) (7.7%) (26.4%) (4.94%) # of BRAMs (Utilization) (.%) (2.5%) (.%) (42.%) (.%) (84.2%) Estimated Power (Watts) Throughput (Mpps) Memory (Kbits) All Square Row Column None layout All Square Row Column None layout 7% 6% 5% 4% 3% 2% % % Power (Watts) # Slices % 6% 4% 2% All Square Row Column None layout 496 All Square Row Column None layout % Figure : Impact of the update engine () layout Row: The units in a same row share the same update engine (Figure 7(a)). Column: The units in a same column share the same update engine (Figure 7(b)). None: No update logic for any unit. The TCAM is not updatable. In these experiments, N = 24, W = 5, R = 4, L = 4, U = 64, H =, and w = 5. So the architecture consists of 4 by 4 units, basically the same as illustrated in Figure 7. The implementation results are shown in Figure. Comparing the Slice results of the All and the None layouts, we can infer that the update logic accounts for more than half of the total logic usage of the architecture in the All layout. In the Square, Row, and Column layouts, by sharing the update engine, the logic resource is reduced by roughly 25%, compared with the All layout. These three layouts achieve the similar logic resource saving, because all of them have each update engine shared by four lookup units. The costs of sharing the update engine include the slightly degraded throughput and the slightly increased power consumption. Such costs are basically due to the wide mux/demux and the stretched signal routing between lookup units and update engines. Higher throughput could be obtained by careful chip floor planning. Also note that the update engine layout has no effect on the memory requirement which is determined only by the lookup units. 4.6 Cost of Explicit Range Matching As discussed in Section 3.3, we provide the capability to add explicit range matching logic to the TCAM architecture so that range-to-ternary conversion can be avoided for some search applications such as ACL. Such explicit range matching logic is based on a heavy use of registers. We conduct experiments to understand the performance cost of the explicit range matching logic. We fix W = 5 and increase the number of 6-bit fields that are specified in ranges. The other parameters are by default: N = 24, R = 4, L = 4, U = 64, H =, and w = 5 (distributed RAM). Each TCAM unit has its own update logic. Table 7 shows that adding the explicit range matching logic for every 6-bit range-based field requires 5K more Slices and 3K more registers. The increased usage of logic also results in higher power consumption. Whether to enable the explicit range matching should be based on the characteristics of the ruleset used in the search application. Consider a ruleset whose expansion ratio (due to range-to-ternary conversion) is a while it requires b times more logic resource to add the explicit range matching logic. Then it is better not to enable the explicit range matching if a < b. 8

11 Table 7: Adding explicit range matching # of range fields 2 Throughput (Mpps) # of Slices (Utilization) (6.72%) (8.56%) (.26%) # of Registers (Utilization) (.54%) (2.82%) (4.2%) Est. Power (Watts) RELATED WORK Although various algorithmic solutions (including those using FPGAs) [4, ] have been proposed as alternatives to TCAMs, their success so far is limited to a few particular applications such as exact matching and longest prefix matching. While they can exploit efficiently the characteristics of real-life data sets, these algorithmic solutions cannot provide the same deterministic performance (e.g. throughput, latency, storage requirement, etc.) as TCAMs on searching over an arbitrary set of ternary words. Most of existing FPGA-based TCAM designs are based on brute-force implementations which map the native TCAM architecture directly onto FPGA logic. A straightforward method is using two bits of registers to encode one TCAM bit. But such a design cannot scale well due to the limited amount of registers which usually are heavily used for various other purposes such as pipelining. For example, the target FPGA device (XC7V2T) in our experiments contains 2.4 Mbits of registers, which may be used to implement a TCAM of no larger than.2 Mbits. In reality the TCAM that can be implemented using registers would be much smaller as a result of routing and timing challenges. Locke [7] proposes a more efficient design based on the 6- bit Shift Register (SRL6). A SRL6 is used to build a 2-bit TCAM. Like the distributed RAM, a SRL6 is based on SliceM. A SliceM can be converted into either 8 SRL6s or a 32 6 distributed RAM in single dual-port mode. The larget TCAM that can be implmented using SRL6s on our target device (XC7V2T) would be.6 Mbits. Recently Ullah et al. [3] and Zerbini et al. [2] present the FPGA implementation of their RAM-based TCAM, respectively. These designs contain the similar basic idea as our design that uses the search key as the address to access RAMs. However, neither of them gives a theoretic analysis or a correctness proof on the construction of TCAM using RAMs. Their architectures are monolithic, which could be viewed as a single large one-stage TCAM unit in our modular architecture. When implementing a large TCAM, their monolithic architectures would suffer from bitwise-anding many wide bit vectors and priority encoding the deep match vector. Due to the lack of thorough investigation on the optimal settings, their FPGA implementations are less efficient than our design. [3] implements a TCAM using more than 6 Mbits BRAMs on a Xilinx Virtex 5 FPGA. When the priority encoder is added, the clock rate of their implementation is merely 22 MHz. The TCAM designs of [2] are implemented on the high-end Altera FPGAs with the fastest speed grade. Even with these large-capacity FP- GAs, their implementations can support a TCAM of no larger than.5 Mbits. 6. CONCLUSION TCAMs are widely used in network infrastructure for various search functions. There have been growing interests in implementing TCAMs using reconfigurable hardware such as FPGA. Such soft TCAMs are more flexible and easier to integrate than ASIC-based hard TCAMs. But existing FPGA-based TCAM designs can support only small-size TCAMs, mainly due to the inefficient resource usage. This paper shares our efforts and experience on pushing the limit in implementing large TCAMs on a state-of-the-art FPGA. We formalize the ideas and the algorithms behind the RAM-based TCAM and analyze the performance thoroughly. After identifying the key challenges, we propose a scalable and modular architecture with multiple optimizations. We evaluate our design conprehensively to understand various performance trade-offs. The FPGA implementation results show that our design can support a large TCAM of 2.4 Mbits while sustaining high throughput of 5 Mpps. 7. REFERENCES [] Openflow - enabling innovation in your network. [2] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In SIGCOMM 3: Proceedings of the ACM SIGCOMM 23 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 99, August 23. [3] D. E. Taylor. Survey and taxonomy of packet classification techniques. ACM Comput. Surv., 37(3): , Sept. 25. [4] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to CAMs? In INFOCOM 3: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, volume, pages 53 63, March/April 23. [5] Xilinx Virtex-7 FPGA Family. [6] Altera Stratix V FPGAs. [7] M. Becchi and P. Crowley. Efficient regular expression evaluation: theory to practice. In ANCS 8: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 5 59, 28. [8] M. Attig and G. J. Brebner. 4 Gb/s programmable packet parsing on a single FPGA. In ANCS : Proceedings of the 7th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 2 23, 2. [9] G. Gogniat, T. Wolf, W. Burleson, J.-P. Diguet, L. Bossuet, and R. Vaslin. Reconfigurable hardware for high-security/ high-performance embedded systems: the SAFES perspective. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 6(2):44 55, 28. 8

12 [] W. Jiang and V. K. Prasanna. Scalable packet classification on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2(9):668 68, 22. [] W. Jiang and V. K. Prasanna. Field-split parallel architecture for high performance multi-match packet classification using FPGAs. In SPAA 9: Proceedings of the 2st annual symposium on Parallelism in algorithms and architectures, pages 88 96, 29. [2] C. A. Zerbini and J. M. Finochietto. Performance evaluation of packet classification on FPGA-based TCAM emulation architectures. In Globecom 2: Proceedings of the IEEE Global Communications Conference, pages , 22. [3] Z. Ullah, M. K. Jaiswal, Y. C. Chan and R. C. C. Cheung. FPGA Implementation of SRAM-based Ternary Content Addressable Memory. In IPDPSW 2: Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 22. [4] O. Rottenstreich, R. Cohen, D. Raz and I. Keslassy. Exact worst-case TCAM rule expansion. IEEE Transactions on Computers, 62(6):27 4, 23. [5] E. Spitznagel, D. Taylor, and J. Turner. Packet classification using extended TCAMs. In ICNP 3: Proceedings of the th IEEE International Conference on Network Protocols, pages 2 3, 23. [6] C. E. LaForest and J. G. Steffan. Efficient multi-ported memories for FPGAs. In FPGA : Proceedings of the 8th annual ACM/SIGDA international symposium on Field programmable gate arrays, pages 4 5, 2. [7] K. Locke. XAPP5 - Parameterizable Content-Addressable Memory application notes/xapp5 Param CAM.pdf. 82

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna University of Southern California Norio Yamagaki NEC Corporation September 1, 2010 Outline