Scalable Ternary Content Addressable Memory Implementation Using FPGAs

Size: px
Start display at page:

Download "Scalable Ternary Content Addressable Memory Implementation Using FPGAs"

Transcription

1 Scalable Ternary Content Addressable Memory Implementation Using FPGAs Weirong Jiang Xilinx Research Labs San Jose, CA, USA ABSTRACT Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-ofthe-art FPGAs. We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 5 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance trade-offs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than Mbits. Categories and Subject Descriptors C..4 [Processor Architectures]: Parallel Architectures; C.2.6 [Computer Communication Networks]: Internetworking General Terms Algorithms, Design, Performance words FPGA; RAM; TCAM. INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be,, or don t care (i.e. ). TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multi-field packet classification, etc. For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle. A priority encoder is needed to obtain the index of the matching word with the highest priority. In a TCAM, the physical location normally determines the priority, e.g. the top word has the highest priority. Most of current TCAMs are implemented as a standalone application-specific integrated circuit (ASIC). We call them the native TCAMs. Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs). The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g. OpenFlow []) where the width and/or the depth for different lookup tables can be variable [2]. Various algorithmic solutions have been proposed as alternatives to native TCAMs. But none of them is exactly equivalent to TCAM. The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM. For some other search functions such as multi-field packet classification, the algorithmic solutions [3, 4] employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set. On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the near-asic performance. State-of-theart FPGA devices such as Xilinx Virtex-7 [5] and Altera Stratix-V [6] provide high clock rate, low power dissipation, rich on-chip resources and large amounts of embedded memory with configurable word width. Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions [7, 8, 9, ]. Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC. Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAM-equivalent search engines. While there exist several FPGA-based TCAM designs, most of them are based on brute-force implementations to mimic the native TCAM architecture. Their resouce usage is inefficient, which makes them less interesting in practice. On the other hand, some recent work [, 2, 3] shows that RAMs can be employed to emulate/implement a TCAM /3/$3. 23 IEEE 7

2 But none of them gives a correctness proof or a thorough study for efficient FPGA implementation. Their architectures are monolithic, which do not scale well in implementing large TCAMs. A goal of this paper is to advance the FPGA-based TCAM designs by investigating both theory and architecture for RAM-based TCAM implementation. The main contributions include: We give an in-depth introduction to the RAM-based TCAM. We formalize the key ideas and the algorithms behind it. We analyze thoroughly the theoretical performance of the RAM-based TCAM and identify the key challenges in implementing a large RAM-based TCAM. We propose a modular and scalable architecture that consists of arrays of small-size RAM-based TCAM units. By decoupling the update logic from each unit, such a modular architecture enables each update engine to be shared among multiple units. Thus the logic resource is saved. We share our experience in implementing the proposed architecture on a state-of-the-art FPGA. The post place and route results show that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 5 million packets per second (Mpps). To the best of our knowledge this is the first FPGA design that implements a TCAM larger than Mbits. We conduct comprehensive experiments to characterize the various performance trade-offs offered by the configurable architecture. We also discuss the support of range matching without range-to-ternary conversion. The rest of the paper is organized as follows. Section 2 gives a detailed introduction to the theoretic aspects of the RAM-based TCAM. Section 3 discusses the hardware architectures for scalable RAM-based TCAM. Section 4 presents the comprehensive evaluation results based on the implementation on a state-of-the-art FPGA. Section 5 reviews the related work on FPGA-based TCAM designs. Section 6 concludes the paper. 2. RAM-BASED TCAM 2. Terminology We first have the following definitions: The Depth of a TCAM (or RAM) is the number of words in the TCAM (or RAM). Denoted as N. The Width of a TCAM (or RAM) is the width (i.e. the number of bits) of each TCAM (or RAM) word. Denoted as W. The Size of a TCAM (or RAM) is the total number of bits of the TCAM (or RAM). It equals N W. The address width of a RAM is the number of bits of the RAM address. Denoted as d. Note that N = 2 d for a RAM. We describe the organization of a TCAM or RAM as Depth Width, i.e., N W. For example, a 2 RAM consists of 2 words where each word is -bit. We call a TCAM or RAM wide (or narrow) if its width is large (or small). We call a TCAM or RAM deep (or shallow) if its depth is large (or small). We also have the notation as shown in Table : Notation k t A A n s Table : Notation Description An input key or a binary number A ternary word An alphabet for -bit characters The set of all n-bit strings over A The length of a string s A n. s = n 2.2 Main Ideas A TCAM can be divided into two logical areas: () TCAM words and (2) priority encoder. Each TCAM word consists of a row of matching cells attached to a same matching line. During lookup, each input key goes to all the N words in parallel and retrieves a N-bit match vector. The i-th bit of the match vector indicates if the key matches the i-th word, i =, 2,, N. In this section, for ease of discussion we consider a TCAM without priority encoder. Thus the output of the considered TCAM is a N-bit match vector instead of the index of the matching word with the highest priority. Looking up a N W TCAM is basically mapping a W - bit binary input key into a N-bit binary match vector. The same mapping can be achieved by using a 2 W N RAM where the W -bit input key is used as the address to access the RAM and each RAM word stores a N-bit vector. Figure (a) shows a TCAM and its corresponding RAM-based implementation. As the TCAM word stores a don t care bit, the match vector is always no matter the input -bit key is or Depth Extension The depth of a native TCAM is increased by stacking vertically words with the same width. Correspondingly in the RAM-based implementation, the depth of a TCAM is extended by increasing the width of the RAM. Each column of the RAM represents the match vector for a word. Figure (b) shows a 2 TCAM which adds a word to the TCAM shown in Figure (a). Correspondingly the RAM-based implementation adds a column to the RAM shown in Figure (a). We see that the memory requirement of either the native TCAM or its RAM-based implementation is linear with the depth. We can also view the depth extension as concatenating the match vectors from multiple shallower TCAMs. For instance, a N W TCAM can be horizontally divided into two TCAMs: one is N W and the other N 2 W, where N = N + N 2. Then there are two RAMs in the corresponding RAM-based implementation: one is 2 W N and the other 2 W N 2. The outputs of the two RAMs are concatenated to obtain the final N-bit match vector. This is essentially equivalent to building a wider RAM by concatenating two RAMs with the same depth. For the sake of 72

3 Native TCAM RAM Native TCAM RAM (a) (b) Match * * [] [] * [] [] Figure 2: Building a 2 TCAM using two TCAMs. (c) [] [] * Table 2: Representing a ternary bit in RAM The value of The value stored at the ternary bit RAM[] RAM[] don t care Figure : (a) Matching a -bit key with a TCAM; (b) Matching a -bit key with a 2 TCAM; (c) Matching a 2-bit key with a 2 TCAM. simplicity, we consider the wide RAM built based on concatenating multiple RAMs as a single RAM Width Extension A wider TCAM deals with a wider input key. When implementing the TCAM in a single RAM, a wider input key (which is used as the address to access the RAM) indicates a wider address width for the RAM. This results in a deeper RAM whose depth is 2 W. Figure (c) shows a 2 TCAM which extends the width of the TCAM shown in Figure (a). As the width of the input key is increased by bit, the depth of the RAM in the corresponding RAM-based TCAM gets doubled. Such a design cannot scale well for wide input keys. An alternative solution is using multiple narrow TCAMs to implement a wide TCAM. For example, a N W TCAM can be vertically divided into two TCAMs: one is N W and the other N W 2, where W = W +W 2. During lookup, a W -bit input key is divided into two segments accordingly: one is W -bit and the other W 2-bit. Each of the two narrower TCAMs matches the corresponding segment of the key and outputs a N-bit match vector. The two match vectors are then bitwise ANDed to obtain the final match vector. The two narrow TCAMs map to two shallow RAMs in the corresponding RAM-based implementation. The total memory requirement becomes 2 W + 2 W 2 instead of 2 W =2 W 2 W 2. Figure 2 shows how a 2 TCAM is built based on two TCAMs Populating the RAM Given a set of ternary words, we need to populate the RAMs so that the RAM-based implementation can fulfill the same search function as the native TCAM. As shown in Figure (a), it is easy to populate the RAM for the RAMbased implementation of a TCAM. Table 2 shows the content of the 2 RAM populated for the TCAM, where RAM[k] denotes the RAM word at the address k, k =,. Principle shows the principle in populating the 2 RAM to represent a -bit TCAM where k {, } and t {,, }. Principle. RAM[k]= if and only if k matches t; otherwise RAM[k]=. Theorem. The RAM populated following Principle achieves the equivalent function as the TCAM that stores t. Proof. In the TCAM, the output for an input k is k matches t. Otherwise, the output is. In the populated RAM, the output for an input k is RAM[k]=. Otherwise, the output is. Thus the populated RAM is equivalent to the represented TCAM. Both Principle and Theorem are directly applicable to the case of W TCAM where k A W, A = {, } and t A W, A = {,, }. Principle can be extended to the case of a N W TCAM that is implemented in a 2 W N RAM. Let RAM[k][i] denote the i-th bit of the k-th word in the RAM. Let t i denote the i-th word in the TCAM. i =, 2,, N. So we have Principle 2 for populating the 2 W N RAM to represent a N W TCAM, where k A W, A = {, } and t i A W, A = {,, }. Principle 2. RAM[k][i]= if and only if k matches t i ; otherwise RAM[k][i]=. i =, 2,, N. When a wide TCAM is built using multiple narrower TCAMs, the RAM corresponding to each narrow TCAM is populated individually by following Principle Algorithms and Analysis This section formalizes the algorithms for using a RAMbased TCAM that is built according to the discussion in Section General Model Based on the discussion in Section 2.2.2, a N W TCAM can be constructed using P narrow TCAMs, P =, 2,, W. 73

4 The size of the i-th TCAM is N W i, i =, 2,, P, and W = P i= Wi. Let RAMi denote the RAM corresponding to the i-th narrow TCAM, i =, 2,, P. The size of RAM i is 2 W i N, i =, 2,, P. Hence the N W TCAM can be implemented using these P RAMs Lookup Algorithm shows the algorithm to search a key over the RAM-based TCAM. It takes O() time to access each RAM. Since the P RAMs are accessed in parallel, the overall time complexity for lookup is O(). Algorithm Lookup Input: A W -bit key k. Input: {RAM i}, i =, 2,, P. Output: A N-bit match vector m. : Divide k into P segments: k {k, k 2,, k P }. k i = W i, i =, 2,, P. 2: Initialize m to be all s: m. 3: for i to P do {bitwise AND} 4: m m & RAM i[k i] 5: end for Update Updating a TCAM can be either adding or deleting a specific TCAM word. Algorithm 2 shows the algorithm to add or delete the n-th word of the TCAM in the RAM-based implementation, where n =, 2,, N. It takes O(2 W i ) time to update the RAM i, i =, 2,, P. As the P RAMs are updated in parallel, the overall time complexity for update is determined by the RAM that takes the longest time for update, which is O(max P i= 2 W i ) = O(2 maxp i= W i ). Algorithm 2 Updating a TCAM word Input: A W -bit ternary word t. Input: The index of t: n. Input: The update operation: op {add, delete}. Output: Updated {RAM i}, i =, 2,, P. : Divide t into P segments: t {t, t 2,, t P }. t i = W i, i =, 2,, P. 2: for i to P do {Update each RAM} 3: for k to 2 W i do 4: if k matches t i and op == add then 5: RAM i[k][n] = 6: else 7: RAM i[k][n] = 8: end if 9: end for : end for Space Analysis The size of RAM i is 2 W i N, i =, 2,, P. Hence the overall memory requirement is P i= (2W i N) = N P To minimize the overall memory requirement, we formulate the problem as: min P 2 W i = min W i= ( min P = {W,W 2,,W P } i= i= 2W i. P 2 W i ) () subject to P W i = W (2) i= For a given P, min P {W,W 2,,W P } i= 2W i = P 2 W P when W i = W, i =, 2,, P. Hence the overall memory requirement is minimum when all the P RAMs have the same P address width, denoted as w = W. The depth of each RAM P is 2 w. Then the overall memory requirement is P (2 Wi N) = i= W w i= (2 w N) = W w 2w N = NW 2w w We define the RAM/TCAM ratio as the number of RAM bits needed to implement a TCAM bit. According to Equation (3), the RAM/TCAM ratio is 2w when all the RAMs w employ the same address width of w. Basically a larger w results in a larger RAM/TCAM ratio, which indicates lower memory efficiency. The minimum RAM/TCAM ratio is 2 when w = (P = W ) or w = 2 (P = W/2). In other words, when the depth of each RAM is 2 (w = ) or 4 (w = 2), the overall memory requirement achieves the minimum, which is 2NW, i.e. twice the size of the corresponding native TCAM. (3) Comparison with Native TCAM Table 3 summarizes the difference between the native TCAM and its corresponding implementation using P RAMs, with respect to time and space complexities. Here we consider all the RAMs employ the same address width (w), so that both the update time and the space complexities achieve the optimum for the RAM-based TCAM (as discussed in Sections and 2.3.4). Table 3: Native TCAM vs. RAM-based TCAM Native TCAM RAM-based TCAM Lookup time O() O() Update time O() O(2 w ) Space NW 2 w NW w 3. HARDWARE ARCHITECTURE We are interested in implementing the RAM-based TCAM on FPGA. While the theoretical discussion in Section 2 excludes the priority encoder, the hardware architecture of the RAM-based TCAM must consider the priority encoder. 3. Basic Architecture The theorectical model of the RAM-based TCAM implementation (discussed in Section 2.3.) can be directly mapped to the hardware architecture shown in Figure 3. A N W TCAM is implemented using P RAMs where the size of the i-th RAM is 2 W i N and P i= Wi = W. As illustrated in Algorithm, a lookup is performed by dividing the input W -bit key into P segments where the length of the i-th segment is W i, i =, 2,, P. Then each segment of the key is used as the address to access the corresponding RAM. Each RAM outputs a N-bit vector. The P N-bit vectors are then bitwise ANDed to generate the final match vector. The match vector is finally fed into 74

5 W W W 2 W P Addr_in Data_out RAM Addr_in Data_out RAM 2 Addr_in Data_out RAM P N N N N Priority Encoder ID Match Figure 3: Basic architecture (without update logic) the priority encoder to obtain the index of the matching word with the highest priority. A -bit Match signal is also generated to indicate if there is any match. We add update logic to the RAM-based TCAM so that it can complete any update by itself at run time. In accordance with Algorithm 2, Figure 4 shows the logic for updating the RAM-based TCAM, where W max = max P i= W i. We use two W -bit binary numbers, denoted as Data and Mask, to represent a W -bit ternary word t that is to be updated. The i-th bit of t is don t care bit if and only if the i-th bit of Mask is set to be, i =, 2,, W. For example, the 2-bit ternary word can be represented by: Data= or, and Mask=. Id specifies the index of the ternary word to be updated. Op indicates if the ternary word is to be added (Op = ) or deleted (Op = ). W max -bit counter Data W W Mask W Id Op W W W 2 2 W P W P + CMP CHG + CMP CHG + CMP CHG Addr_R Addr_W We Data_out Data_in RAM Addr_R Addr_W We Data_out Data_in RAM 2 Addr_R Addr_W We Data_out Data_in RAM P Figure 4: Update logic. CMP : Compare. CHG : Change. Adding or deleting the Id-th ternary word t is accomplished by setting or clearing the Id-th bit from all the RAM words whose addresses match t. Meanwhile we must keep the rest of the bits of these RAM words unchanged. Hence we need to read the original content of the RAM words, change only the Id-th bit and then write the updated RAM words back to the RAM. This requires 2 2 w clock cycles to update a single-port RAM whose address width is w. To reduce the update latency, we utilize a simple dual-port RAM and perform the read and write at the same clock cycle. A simple dual-port RAM has two input ports and one output port. One input port is for Read only and the other is for Write only. At each clock cycle during update, the update logic writes the updated RAM word to the address k while reading the content of the RAM word at the address k +. Hence the update latency becomes 2 w + clock cycles where the clock cycle is consumed to fetch the content of the first RAM word. Another part of the update logic is a state matchine (not shown in Figure 4) that switches the state of the TCAM between lookup and update. During update, no lookup is permitted and any match result is invalid. 3.2 Modular Architecture In implementing a large-scale RAM-based TCAM on FPGA, there are two main challenges: Throughput: When the TCAM is deeper or wider, the logic and the routing complexities become larger, especially for bitwise-anding a lot of wide bit vectors and for priority encoding a deep match vector. This results in significant degradation in the achievable clock rate, which determines the maximum throughput of the RAM-based TCAM. Resource usage: The on-chip resource of a FPGA device is limited. Hence we must optimize the architecture to save the resource or use the resource efficiently. We need to find out the best memory configuration based on the physical capability. It is also desirable to enable resource sharing between subsystems. We propose a scalable and modular architecture that employs configurable small-size RAM-based TCAM units as building blocks. Both bit vector bitwise-anding and priority encoding are performed in a localized and pipelined fashion so that high throughput is sustained for large TCAMs. We decouple the update logic from each unit so that a single update engine can be shared flexibly by multiple TCAM units. On-chip logic resources are thus saved. Note that such resource sharing is only possible in a modular architecture Overview The top-level design consists of a grid of units, which are organized in multiple rows. Figure 5 shows the top-level architecture with R rows each of which contains L units. The TCAM words with higher priority are stored in the units with lower index. The units within a row are searched sequentially in a pipelined fashion. Priority is resolved locally within each unit. After each row outputs a matching result, a global priority encoder is needed to select the one with the globally highest priority Unit Design A TCAM unit is basically a U W TCAM implemented in RAMs, where U is the number of TCAM words per unit. Figure 6 depicts the architecture of a TCAM unit. Each unit performs the local TCAM lookup and combines the local match result with the result from the preceding unit. As 75

6 _in _in _in _out _out _out Unit Unit Unit L- _in _out _in _out _in _out Unit L Unit L+ Unit 2L- Priority Encoder Matching ID _in _out Unit (R-)L _in _out Unit (R-)L+ _in _out Unit RL- Figure 5: Top-level architecture the unit index determines the priority, a matching TCAM word stored in the preceding units always has a higher priority than the local matching one. The U W TCAM is constructed using P RAMs based on the basic architecture shown in Section 3.. We use the same address width w for all the P RAMs to achieve the maximum memory efficiency as discussed in Section _in W Match U W TCAM (in RAM) ID MUX Figure 6: A Unit _out When W is large, there are many RAMs each of which outputs a U-bit vector. The throughput may degrade in bitwise-anding a large number of bit vectors. We divide a unit into multiple pipelined stages. Let H denote the number of stages in a unit. Then each stage contains P H RAMs. Within each stage, the bit vectors generated by the P RAMs are bitwise-anded. The resulting U-bit vector is H combined with the bit vector passed from the previous stage and then passed to the next stage. The last stage of the unit performs the local priority encoding Update Engine We make the following observations: Updating a TCAM word involves updating only one unit. Update logic is identical for the units with the same memory organization. To save logic resource, it is desirable to share the update logic between units. We decouple the update logic from units and build multiple update engines. Each update engine contains the update logic and serves multiple units. An update engine maintains a single state machine and decides which unit to be updated based on the index (Id) of the TCAM word to be updated. A unit receives from its update engine the addresses and the write enable signals for its RAMs. The unit also interacts with its update engine to exchange the bit vectors to update each RAM word. Due to the decoupling of the update logic from the units, the association between the lookup units (LUs) and the update engines (s) is flexible. The only constraint is that the units served by the same update engine must have the same memory organization (i.e. P and w). Figure 7 shows three different example layouts of the update engines in a 4-row, 4-unit-per-row architecture. 3.3 Explicit Range Match In some network search applications such as access control list (ACL), a packet is matched against a set of rules. An ACL-like rule specifies the match condition on each of the multiple packet header fields. Some fields such as TCP ports are specified using ranges rather than ternary strings. Taking 5-field ACL as an example, the two 6-bit port fields are normally specified in ranges. The ranges must be converted into ternary strings so that such rules can be stored in a TCAM. However, a range may be converted into multiple ternary strings. A r-bit range can be expanded to 2(r ) prefixes or 2(r 2) ternary strings. If there are D of such fields in a rule, then this rule can be expanded to (2r 4) D ternary words in the worst case. Such a problem is called rule expansion [4]. Various range encoding methods have been proposed to minimize rule expansion. Even with the optimal range encoding [4], it needs r ternary words to 76

7 (a) Row (b) Column (c) Square Figure 7: Example layouts of update engines (s). LU: lookup unit. will be mapped to a 2 d min W physical RAM. Thus the RAM/TCAM ratio becomes 2max(w,d min ) instead of 2w. w w A trick that can be played is to map multiple shallow (logical) RAMs to a deep physical RAM. For example, two 2 d W (logical) RAMs can be mapped to a single 2 d+ W physical RAM. But the throughput will be halved unless the physical RAM has two sets of input/output ports used independently for the two (logical) RAMs. While some multi-port RAM designs [6] are available, they bring extra complications and are beyond the scope of this paper. Hence when implementing the RAM-based TCAM in real hardware, the address width of each RAM, i.e. w, should be carefully chosen based on the available physical configuration. 4. PERFORMANCE EVALUATION We implement our modular RAM-based TCAM architecture on a Xilinx Virtex 7 XC7V2T device with -2 speed grade. We evaluate the performance based on the post place and route results from the Xilinx Vivado 23. development toolset. To recap, we list the key parameters of the architecture in Table 4. Note that N = R L U. Table 4: Architectural parameters Parameter Description N TCAM depth W TCAM width R The number of rows L The number of units per row U The number of TCAM words per unit H The number of stages per unit w The address width of the RAM represent a r-bit range. In such a case, a rule with D range fields will occupy O(r D ) TCAM words. An attractive advantage of FPGA compared with ASIC is that we can reprogram the hardware on-the-fly to add customorized logic. So for the ACL-like search problems, we adopt the similar idea as [5] to augment the TCAM design with explicit range match support instead of converting ranges into ternary strings. This is achieved by storing the lower and upper bounds of each range explicitly in registers. Hence, if there are N rules each containing D r-bit port fields, then we require totally N D r 2 bits of registers to store the lower and the upper bounds of all the ranges. On the other hand, the size of the TCAM that needs to be stored in RAMs is reduced to N (W D r). 3.4 Mapping to Physical Hardware According to the theoretical analysis in Section 2.3.4, the RAM-based implementation of a N W TCAM requires the minimum memory when employing shallow RAMs with the same depth of 2 or 4. However, real hardware has limitations on the minimum depth of physical RAMs. For example, each block RAM (BRAM) available on a Xilinx Virtex 7 FPGA can be configured as 52 72, K 36, 2K 8, 4K 9, 8K 4, 6K 2, or 32K, in simple dual-port mode. In other words, the minimum depth for a BRAM is 52=2 9. Let d min denote the minimum address width of the physical RAM. A N W (logical) RAM where N 2 d min 4. Analysis and Estimation Due to its pipelined architecture, our RAM-based TCAM implementation processes one packet every clock cycle. Thus the throughput is F million packets per second (Mpps) when the clock rate of the implementation achieves F MHz. During lookup, each packet traverses the R rows in parallel. It takes L H clock cycles to go through each row. One clock cycle is needed for final priority encoding when the architecture consists of more than one rows. Thus the lookup latency in terms of the number of clock cycles is = { L H if R = L H + if R >. The address width of RAMs, i.e. w, is a critical parameter in our RAM-based TCAM. The update latency is 2 w + while the memory requirement for implementing a N W TCAM is 2w NW. To determine the optimal w, w we examine the physical memory resource available on the FPGA device. There are two types of memory resources in Xilinx Virtex FPGAs: distributed RAM and block RAM (BRAM). While BRAMs are provided as standalone RAMs, distributed RAM is coupled with logic resources. The basic logic resource unit of FPGA is usually called a Slice. Only a certain type of Slice, named SliceM, can be used to build the distributed RAM. As required by our architecture, we consider the RAMs only in simple dual-port (SDP) mode. Table 5 summarizes the total amount and the minimum ad- 77

8 Throughput (Mpps) Memory (Kbits) # words (N) L=4 L=8 L=6 L= # words (N) % 8% 6% 4% 2% % L=4 L=8 L=6 L=32 Utilization Power (Watts) # Slices # words (N) L=4 L=8 L=6 L= # words (N) % 8% 6% 4% 2% % L=4 L=8 L=6 L=32 Utilization Figure 8: Increasing the TCAM depth (N) dress width (d min) of the memory resource available on our target FPGA device. Table 5: Memory resource on a XC7V2T RAM type (in SDP mode) Total size (bits) d min Distributed RAM BRAM Either distributed RAM or BRAM can be employed to implement the RAM-based TCAM architecture. In either case, we set w=d min of the employed RAM type to achieve the highest memory efficiency. Based on the information from Table 5 we can estimate the maximum size of the TCAM that can be implemented on the target device. When the architecture is implemented using distributed RAM, the RAM/TCAM ratio is 25 and the maximum TCAM size is =2586 bits. When using BRAM, the RAM/TCAM 32/5 ratio is and the maximum TCAM size is = /9 bits. We can see that, though the total amount of BRAM bits is nearly the triple of that of distributed RAM bits, BRAM-based implementation supports a much smaller TCAM due to the higher RAM/TCAM ratio. Moreover, the update latency of distributed RAM-based implementation is 33 clock cycles, while the update latency for BRAM-based implementation is 53 clock cycles. Hence in most of our experiments, distributed RAMs (w = 5) are employed. Also note that our architecture is modular where each unit may independently select the RAM type for TCAM implementation. Thus the maximum TCAM size would be bits when both distributed RAMs and BRAMs are utilized. 4.2 Scalability We are interested in how the performance scales when the TCAM depth (i.e. N) or the TCAM width (W ) is increased. The key performance metrics include the throughput, the memory requirement, the power consumption estimates, and the resource usage. In these experiments, the default parameter settings are L = 4, U = 64, H =, and w = 5. Each unit contains its own update logic. First, we fix W = 5 and increase N by doubling R. Figure 8 shows the results where the memory and the Slices results are drawn using a logarithmic scale. The throughput is measured as As expected, the throughput degrades for a deeper TCAM. This is because a larger R results in a deeper final priority encoder which becomes the critical path. Also with a larger TCAM, the resource utilization approaches %. This makes it difficult to route signals, which further lowers the achievable clock rate. Fortunately because of the configurable architecture, we can trade the latency for throughput. Since N = R L U, we can increase L to reduce R for a given N while keeping other parameters fixed. As shown in Figure 8, a larger L results in a higher throughput, though it is at the expense of a larger latency. By tuning the latency-throughput trade-off, our design can sustain a 5 MHz clock rate for large TCAMs up to 6K 5 bits = 2.4 Mbits. Such a clock rate allows the design to process 5 million packets per second (Mpps) which translates to Gbps throughput for minimum-size Ethernet packets. Second, we fix the TCAM depth N = 496 and increase the TCAM width W. Figure 9 shows that a larger TCAM width results in a lower throughput. This is because there are W RAMs per unit where w = 5 in the implementation. w With a large W, it becomes time-critical to bitwise-and a large number of bit vectors within each unit. Again this can be amended by trading the latency for throughput. We increase the number of stages per unit so that each stage handles a smaller number of RAMs. As shown in Figure 9, the throughput is improved by increasing H by. This on the other hand increases the latency by L = 4 clock cycles. In both the above experiments, the resource usage is linear with the TCAM size (N W ). The estimated power con- 78

9 Throughput (Mpps) Memory (Kbits) Word width (W) H= H= Word width (W) 6% 5% 4% 3% 2% % % H= H=2 Utilization Power (Watts) # Slices Word width (W) H= H= Word width (W) 6% 5% 4% 3% 2% % % H= H=2 Utilization Figure 9: Increasing the TCAM width (W ) sumption is sublinear with the TCAM depth while is linear with the TCAM width. 4.3 Impact of Unit Size Each TCAM unit in our architecture stores U TCAM words. It is desirable to have a small U so that the local bit vector bitwise-anding and priority encoding within each unit do not become the critical path. On the other hand a smaller U leads to a larger L when R is fixed for a given N. Thus we can tune the latency-throughput trade-off by changing U. In this experiment, we fix R = 4, H = and vary U in implementing a 24 5 TCAM. As expected, Figure shows that a larger U results in a lower throughput as well as a lower latency. Such a trade-off can be exploited for some latency-sensitive applications where the latency is measured in terms of nanoseconds instead of the number of clock cycles. Based on the results shown in Figure, when U is doubled from 64 to 28, the throughput is slightly degraded while the latency is reduced from 6 5 = 3 ns to = 2 ns. The change of U has little impact on other performance metrics, which thus are not shown here. Throughput Unit size (U) Latency Throughput (Mpps) Latency (# of clock cycles) Figure : Increasing the unit size (U) 4.4 Distributed vs. Block RAMs As discussed in Section 4., distributed RAMs are more efficient than BRAMs in implementing the RAM-based TCAM on the target FPGA. But usually it is desirable to integrate the RAM-based TCAM with other engines (such as a packet parser) in a single FPGA device to comprise a complete packet processing system. Then the choice of the RAM type may depend on not only the efficiency but also the resource budget. BRAMs will be preferred to implement the RAM-based TCAM in case the other engines require a lot of Slices but few BRAMs. Hence we conduct experiments to characterize the performance of the RAMbased TCAMs implemented using the two different RAM types. In these experiments, W = 5, L = 4, U = 64, and H =. Each TCAM unit contains its own update logic. As shown in Table 6, distributed RAM-based implementations achieve higher clock rates and lower power consumption than BRAM-based implementations. This is due to the fact that a BRAM is deeper and larger, and thus requires longer access time and dissipates more power than a distributed RAM. Because distributed RAMs are based on Slices (SliceM), the distributed RAM-based implementations require much more logic resource (in terms of Slices) than BRAM-based implementations. 4.5 Impact of Update Engine Layout As discussed in Section 3.2.3, we can have flexible associations between lookup units and update engines by decoupling the update logic from each unit. We conduct experiments to evaluate the impact of different update engine () layouts on the performance of the architecture. The evaluated update engine layouts include: All: Each unit contains its own update logic. Square: The four neighboring units forming a square share the same update engine (Figure 7(c)). 79

10 Table 6: Implementation results based on different RAM types TCAM size: N W 24 5 bits bits bits RAM type Distributed Block Distributed Block Distributed Block Throughput (Mpps) # of Slices (Utilization) (6.72%) (3.97%) (3.8%) (7.7%) (26.4%) (4.94%) # of BRAMs (Utilization) (.%) (2.5%) (.%) (42.%) (.%) (84.2%) Estimated Power (Watts) Throughput (Mpps) Memory (Kbits) All Square Row Column None layout All Square Row Column None layout 7% 6% 5% 4% 3% 2% % % Power (Watts) # Slices % 6% 4% 2% All Square Row Column None layout 496 All Square Row Column None layout % Figure : Impact of the update engine () layout Row: The units in a same row share the same update engine (Figure 7(a)). Column: The units in a same column share the same update engine (Figure 7(b)). None: No update logic for any unit. The TCAM is not updatable. In these experiments, N = 24, W = 5, R = 4, L = 4, U = 64, H =, and w = 5. So the architecture consists of 4 by 4 units, basically the same as illustrated in Figure 7. The implementation results are shown in Figure. Comparing the Slice results of the All and the None layouts, we can infer that the update logic accounts for more than half of the total logic usage of the architecture in the All layout. In the Square, Row, and Column layouts, by sharing the update engine, the logic resource is reduced by roughly 25%, compared with the All layout. These three layouts achieve the similar logic resource saving, because all of them have each update engine shared by four lookup units. The costs of sharing the update engine include the slightly degraded throughput and the slightly increased power consumption. Such costs are basically due to the wide mux/demux and the stretched signal routing between lookup units and update engines. Higher throughput could be obtained by careful chip floor planning. Also note that the update engine layout has no effect on the memory requirement which is determined only by the lookup units. 4.6 Cost of Explicit Range Matching As discussed in Section 3.3, we provide the capability to add explicit range matching logic to the TCAM architecture so that range-to-ternary conversion can be avoided for some search applications such as ACL. Such explicit range matching logic is based on a heavy use of registers. We conduct experiments to understand the performance cost of the explicit range matching logic. We fix W = 5 and increase the number of 6-bit fields that are specified in ranges. The other parameters are by default: N = 24, R = 4, L = 4, U = 64, H =, and w = 5 (distributed RAM). Each TCAM unit has its own update logic. Table 7 shows that adding the explicit range matching logic for every 6-bit range-based field requires 5K more Slices and 3K more registers. The increased usage of logic also results in higher power consumption. Whether to enable the explicit range matching should be based on the characteristics of the ruleset used in the search application. Consider a ruleset whose expansion ratio (due to range-to-ternary conversion) is a while it requires b times more logic resource to add the explicit range matching logic. Then it is better not to enable the explicit range matching if a < b. 8

11 Table 7: Adding explicit range matching # of range fields 2 Throughput (Mpps) # of Slices (Utilization) (6.72%) (8.56%) (.26%) # of Registers (Utilization) (.54%) (2.82%) (4.2%) Est. Power (Watts) RELATED WORK Although various algorithmic solutions (including those using FPGAs) [4, ] have been proposed as alternatives to TCAMs, their success so far is limited to a few particular applications such as exact matching and longest prefix matching. While they can exploit efficiently the characteristics of real-life data sets, these algorithmic solutions cannot provide the same deterministic performance (e.g. throughput, latency, storage requirement, etc.) as TCAMs on searching over an arbitrary set of ternary words. Most of existing FPGA-based TCAM designs are based on brute-force implementations which map the native TCAM architecture directly onto FPGA logic. A straightforward method is using two bits of registers to encode one TCAM bit. But such a design cannot scale well due to the limited amount of registers which usually are heavily used for various other purposes such as pipelining. For example, the target FPGA device (XC7V2T) in our experiments contains 2.4 Mbits of registers, which may be used to implement a TCAM of no larger than.2 Mbits. In reality the TCAM that can be implemented using registers would be much smaller as a result of routing and timing challenges. Locke [7] proposes a more efficient design based on the 6- bit Shift Register (SRL6). A SRL6 is used to build a 2-bit TCAM. Like the distributed RAM, a SRL6 is based on SliceM. A SliceM can be converted into either 8 SRL6s or a 32 6 distributed RAM in single dual-port mode. The larget TCAM that can be implmented using SRL6s on our target device (XC7V2T) would be.6 Mbits. Recently Ullah et al. [3] and Zerbini et al. [2] present the FPGA implementation of their RAM-based TCAM, respectively. These designs contain the similar basic idea as our design that uses the search key as the address to access RAMs. However, neither of them gives a theoretic analysis or a correctness proof on the construction of TCAM using RAMs. Their architectures are monolithic, which could be viewed as a single large one-stage TCAM unit in our modular architecture. When implementing a large TCAM, their monolithic architectures would suffer from bitwise-anding many wide bit vectors and priority encoding the deep match vector. Due to the lack of thorough investigation on the optimal settings, their FPGA implementations are less efficient than our design. [3] implements a TCAM using more than 6 Mbits BRAMs on a Xilinx Virtex 5 FPGA. When the priority encoder is added, the clock rate of their implementation is merely 22 MHz. The TCAM designs of [2] are implemented on the high-end Altera FPGAs with the fastest speed grade. Even with these large-capacity FP- GAs, their implementations can support a TCAM of no larger than.5 Mbits. 6. CONCLUSION TCAMs are widely used in network infrastructure for various search functions. There have been growing interests in implementing TCAMs using reconfigurable hardware such as FPGA. Such soft TCAMs are more flexible and easier to integrate than ASIC-based hard TCAMs. But existing FPGA-based TCAM designs can support only small-size TCAMs, mainly due to the inefficient resource usage. This paper shares our efforts and experience on pushing the limit in implementing large TCAMs on a state-of-the-art FPGA. We formalize the ideas and the algorithms behind the RAM-based TCAM and analyze the performance thoroughly. After identifying the key challenges, we propose a scalable and modular architecture with multiple optimizations. We evaluate our design conprehensively to understand various performance trade-offs. The FPGA implementation results show that our design can support a large TCAM of 2.4 Mbits while sustaining high throughput of 5 Mpps. 7. REFERENCES [] Openflow - enabling innovation in your network. [2] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In SIGCOMM 3: Proceedings of the ACM SIGCOMM 23 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 99, August 23. [3] D. E. Taylor. Survey and taxonomy of packet classification techniques. ACM Comput. Surv., 37(3): , Sept. 25. [4] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to CAMs? In INFOCOM 3: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, volume, pages 53 63, March/April 23. [5] Xilinx Virtex-7 FPGA Family. [6] Altera Stratix V FPGAs. [7] M. Becchi and P. Crowley. Efficient regular expression evaluation: theory to practice. In ANCS 8: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 5 59, 28. [8] M. Attig and G. J. Brebner. 4 Gb/s programmable packet parsing on a single FPGA. In ANCS : Proceedings of the 7th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 2 23, 2. [9] G. Gogniat, T. Wolf, W. Burleson, J.-P. Diguet, L. Bossuet, and R. Vaslin. Reconfigurable hardware for high-security/ high-performance embedded systems: the SAFES perspective. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 6(2):44 55, 28. 8

12 [] W. Jiang and V. K. Prasanna. Scalable packet classification on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2(9):668 68, 22. [] W. Jiang and V. K. Prasanna. Field-split parallel architecture for high performance multi-match packet classification using FPGAs. In SPAA 9: Proceedings of the 2st annual symposium on Parallelism in algorithms and architectures, pages 88 96, 29. [2] C. A. Zerbini and J. M. Finochietto. Performance evaluation of packet classification on FPGA-based TCAM emulation architectures. In Globecom 2: Proceedings of the IEEE Global Communications Conference, pages , 22. [3] Z. Ullah, M. K. Jaiswal, Y. C. Chan and R. C. C. Cheung. FPGA Implementation of SRAM-based Ternary Content Addressable Memory. In IPDPSW 2: Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 22. [4] O. Rottenstreich, R. Cohen, D. Raz and I. Keslassy. Exact worst-case TCAM rule expansion. IEEE Transactions on Computers, 62(6):27 4, 23. [5] E. Spitznagel, D. Taylor, and J. Turner. Packet classification using extended TCAMs. In ICNP 3: Proceedings of the th IEEE International Conference on Network Protocols, pages 2 3, 23. [6] C. E. LaForest and J. G. Steffan. Efficient multi-ported memories for FPGAs. In FPGA : Proceedings of the 8th annual ACM/SIGDA international symposium on Field programmable gate arrays, pages 4 5, 2. [7] K. Locke. XAPP5 - Parameterizable Content-Addressable Memory application notes/xapp5 Param CAM.pdf. 82

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna University of Southern California Norio Yamagaki NEC Corporation September 1, 2010 Outline

More information

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA 2010 International Conference on Field Programmable Logic and Applications Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna Ming Hsieh Department

More information

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 PP 11-18 www.iosrjen.org Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory S.Parkavi (1) And S.Bharath

More information

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Line-rate packet processing in hardware: the evolution towards 400 Gbit/s

Line-rate packet processing in hardware: the evolution towards 400 Gbit/s Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 259 268 doi: 10.14794/ICAI.9.2014.1.259 Line-rate packet processing in hardware:

More information

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1 Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1 Shijie Zhou, Yun R. Qu, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering, University of Southern

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

CONTENT-addressable memory (CAM) is a particular

CONTENT-addressable memory (CAM) is a particular An Efficient I/O Architecture for RAM-based Content-Addressable Memory on FPGA Xuan-Thuan Nguyen, Trong-Thuc Hoang, Hong-Thu Nguyen, Katsumi Inoue, and Cong-Kha Pham 1 arxiv:1804.02330v3 [cs.ar] 26 Jun

More information

Resource-Efficient SRAM-based Ternary Content Addressable Memory

Resource-Efficient SRAM-based Ternary Content Addressable Memory Abstract: Resource-Efficient SRAM-based Ternary Content Addressable Memory Static random access memory (SRAM)-based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Frugal IP Lookup Based on a Parallel Search

Frugal IP Lookup Based on a Parallel Search Frugal IP Lookup Based on a Parallel Search Zoran Čiča and Aleksandra Smiljanić School of Electrical Engineering, Belgrade University, Serbia Email: cicasyl@etf.rs, aleksandra@etf.rs Abstract Lookup function

More information

High-throughput Online Hash Table on FPGA*

High-throughput Online Hash Table on FPGA* High-throughput Online Hash Table on FPGA* Da Tong, Shijie Zhou, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 989 Email: datong@usc.edu,

More information

Design principles in parser design

Design principles in parser design Design principles in parser design Glen Gibb Dept. of Electrical Engineering Advisor: Prof. Nick McKeown Header parsing? 2 Header parsing? Identify headers & extract fields A???? B???? C?? Field Field

More information

FPGA Implementation of Lookup Algorithms

FPGA Implementation of Lookup Algorithms 2011 IEEE 12th International Conference on High Performance Switching and Routing FPGA Implementation of Lookup Algorithms Zoran Chicha, Luka Milinkovic, Aleksandra Smiljanic Department of Telecommunications

More information

Tree-Based Minimization of TCAM Entries for Packet Classification

Tree-Based Minimization of TCAM Entries for Packet Classification Tree-Based Minimization of TCAM Entries for Packet Classification YanSunandMinSikKim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington 99164-2752, U.S.A.

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES Greg Hankins APRICOT 2012 2012 Brocade Communications Systems, Inc. 2012/02/28 Lookup Capacity and Forwarding

More information

An Architecture for IPv6 Lookup Using Parallel Index Generation Units

An Architecture for IPv6 Lookup Using Parallel Index Generation Units An Architecture for IPv6 Lookup Using Parallel Index Generation Units Hiroki Nakahara, Tsutomu Sasao, and Munehiro Matsuura Kagoshima University, Japan Kyushu Institute of Technology, Japan Abstract. This

More information

Three Different Designs for Packet Classification

Three Different Designs for Packet Classification Three Different Designs for Packet Classification HATAM ABDOLI Computer Department Bu-Ali Sina University Shahid Fahmideh street, Hamadan IRAN abdoli@basu.ac.ir http://www.profs.basu.ac.ir/abdoli Abstract:

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Online Heavy Hitter Detector on FPGA

Online Heavy Hitter Detector on FPGA Online Heavy Hitter Detector on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu Abstract Detecting heavy

More information

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011

Switch and Router Design. Packet Processing Examples. Packet Processing Examples. Packet Processing Rate 12/14/2011 // Bottlenecks Memory, memory, 88 - Switch and Router Design Dr. David Hay Ross 8b dhay@cs.huji.ac.il Source: Nick Mckeown, Isaac Keslassy Packet Processing Examples Address Lookup (IP/Ethernet) Where

More information

High Performance Architecture for Flow-Table Lookup in SDN on FPGA

High Performance Architecture for Flow-Table Lookup in SDN on FPGA High Performance Architecture for Flow-Table Lookup in SDN on FPGA Rashid Hatami a, Hossein Bahramgiri a and Ahmad Khonsari b a Maleke Ashtar University of Technology, Tehran, Iran b Tehran University

More information

Efficient Packet Classification for Network Intrusion Detection using FPGA

Efficient Packet Classification for Network Intrusion Detection using FPGA Efficient Packet Classification for Network Intrusion Detection using FPGA ABSTRACT Haoyu Song Department of CSE Washington University St. Louis, USA hs@arl.wustl.edu FPGA technology has become widely

More information

Scalable Packet Classification on FPGA

Scalable Packet Classification on FPGA 1668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 9, SEPTEMBER 2012 Scalable Packet Classification on FPGA Weirong Jiang, Member, IEEE, and Viktor K. Prasanna, Fellow,

More information

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005 Towards Effective Packet Classification J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005 Outline Algorithm Study Understanding Packet Classification Worst-case Complexity

More information

Large-scale Multi-flow Regular Expression Matching on FPGA*

Large-scale Multi-flow Regular Expression Matching on FPGA* 212 IEEE 13th International Conference on High Performance Switching and Routing Large-scale Multi-flow Regular Expression Matching on FPGA* Yun Qu Ming Hsieh Dept. of Electrical Eng. University of Southern

More information

Programmable Memory Blocks Supporting Content-Addressable Memory

Programmable Memory Blocks Supporting Content-Addressable Memory Programmable Memory Blocks Supporting Content-Addressable Memory Frank Heile, Andrew Leaver, Kerry Veenstra Altera 0 Innovation Dr San Jose, CA 95 USA (408) 544-7000 {frank, aleaver, kerry}@altera.com

More information

Last Lecture: Network Layer

Last Lecture: Network Layer Last Lecture: Network Layer 1. Design goals and issues 2. Basic Routing Algorithms & Protocols 3. Addressing, Fragmentation and reassembly 4. Internet Routing Protocols and Inter-networking 5. Router design

More information

Area Efficient Z-TCAM for Network Applications

Area Efficient Z-TCAM for Network Applications Area Efficient Z-TCAM for Network Applications Vishnu.S P.G Scholar, Applied Electronics, Coimbatore Institute of Technology. Ms.K.Vanithamani Associate Professor, Department of EEE, Coimbatore Institute

More information

Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification

Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification Design of a High Speed FPGA-Based Classifier for Efficient Packet Classification V.S.Pallavi 1, Dr.D.Rukmani Devi 2 PG Scholar 1, Department of ECE, RMK Engineering College, Chennai, Tamil Nadu, India

More information

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21 100 GBE AND BEYOND 2011 Brocade Communications Systems, Inc. Diagram courtesy of the CFP MSA. v1.4 2011/11/21 Current State of the Industry 10 Electrical Fundamental 1 st generation technology constraints

More information

SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification. Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H.

SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification. Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H. SSA: A Power and Memory Efficient Scheme to Multi-Match Packet Classification Fang Yu, T.V. Lakshman, Martin Austin Motoyama, Randy H. Katz Presented by: Discussion led by: Sailesh Kumar Packet Classification

More information

Scalable High Throughput and Power Efficient IP-Lookup on FPGA

Scalable High Throughput and Power Efficient IP-Lookup on FPGA Scalable High Throughput and Power Efficient IP-Lookup on FPGA Hoang Le and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA {hoangle,

More information

Implementation of Boundary Cutting Algorithm Using Packet Classification

Implementation of Boundary Cutting Algorithm Using Packet Classification Implementation of Boundary Cutting Algorithm Using Packet Classification Dasari Mallesh M.Tech Student Department of CSE Vignana Bharathi Institute of Technology, Hyderabad. ABSTRACT: Decision-tree-based

More information

PACKET classification is a prominent technique used in

PACKET classification is a prominent technique used in IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 5, MAY 2014 1135 A Scalable and Modular Architecture for High-Performance Packet Classification Thilan Ganegedara, Weirong Jiang, and

More information

Performance Evaluation and Improvement of Algorithmic Approaches for Packet Classification

Performance Evaluation and Improvement of Algorithmic Approaches for Packet Classification Performance Evaluation and Improvement of Algorithmic Approaches for Packet Classification Yaxuan Qi, Jun Li Research Institute of Information Technology (RIIT) Tsinghua University, Beijing, China, 100084

More information

CHAPTER 4 BLOOM FILTER

CHAPTER 4 BLOOM FILTER 54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

IP packet forwarding, or simply, IP-lookup, is a classic

IP packet forwarding, or simply, IP-lookup, is a classic Scalable Tree-based Architectures for IPv4/v6 Lookup Using Prefix Partitioning Hoang Le, Student Member, IEEE, and Viktor K. Prasanna, Fellow, IEEE Abstract Memory efficiency and dynamically updateable

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura and Viktor. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 99

More information

Jakub Cabal et al. CESNET

Jakub Cabal et al. CESNET CONFIGURABLE FPGA PACKET PARSER FOR TERABIT NETWORKS WITH GUARANTEED WIRE- SPEED THROUGHPUT Jakub Cabal et al. CESNET 2018/02/27 FPGA, Monterey, USA Packet parsing INTRODUCTION It is among basic operations

More information

Multi-dimensional Packet Classification on FPGA: 100 Gbps and Beyond

Multi-dimensional Packet Classification on FPGA: 100 Gbps and Beyond Multi-dimensional Packet Classification on FPGA: 00 Gbps and Beyond Yaxuan Qi, Jeffrey Fong 2, Weirong Jiang 3, Bo Xu 4, Jun Li 5, Viktor Prasanna 6, 2, 4, 5 Research Institute of Information Technology

More information

Lecture 11: Packet forwarding

Lecture 11: Packet forwarding Lecture 11: Packet forwarding Anirudh Sivaraman 2017/10/23 This week we ll talk about the data plane. Recall that the routing layer broadly consists of two parts: (1) the control plane that computes routes

More information

Field-Split Parallel Architecture for High Performance Multi-Match Packet Classification Using FPGAs

Field-Split Parallel Architecture for High Performance Multi-Match Packet Classification Using FPGAs Field-Split Parallel Architecture for High Performance Multi-Match Packet Classification Using FPGAs Weirong Jiang Ming Hsieh Department of Electrical Engineering University of Southern California Los

More information

A Fast Ternary CAM Design for IP Networking Applications

A Fast Ternary CAM Design for IP Networking Applications A Fast Ternary CAM Design for IP Networking Applications Bruce Gamache (gamacheb@coloradoedu) Zachary Pfeffer (pfefferz@coloradoedu) Sunil P Khatri (spkhatri@coloradoedu) Department of Electrical and Computer

More information

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results. Algorithms for Routing Lookups and Packet Classification October 3, 2000 High Level Outline Part I. Routing Lookups - Two lookup algorithms Part II. Packet Classification - One classification algorithm

More information

Scalable Multi-Pipeline Architecture for High Performance Multi-Pattern String Matching

Scalable Multi-Pipeline Architecture for High Performance Multi-Pattern String Matching Scalable Multi-Pipeline Architecture for High Performance Multi-Pattern String Matching Weirong Jiang, Yi-Hua E. Yang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

High-Performance Network Data-Packet Classification Using Embedded Content-Addressable Memory

High-Performance Network Data-Packet Classification Using Embedded Content-Addressable Memory High-Performance Network Data-Packet Classification Using Embedded Content-Addressable Memory Embedding a TCAM block along with the rest of the system in a single device should overcome the disadvantages

More information

Copyright 2011 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

Copyright 2011 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol. Copyright 2011 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol. 8008, 80080E, DOI: http://dx.doi.org/10.1117/12.905281 ) and is made

More information

Scalable Packet Classification on FPGA

Scalable Packet Classification on FPGA Scalable Packet Classification on FPGA 1 Deepak K. Thakkar, 2 Dr. B. S. Agarkar 1 Student, 2 Professor 1 Electronics and Telecommunication Engineering, 1 Sanjivani college of Engineering, Kopargaon, India.

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

SCALABLE HIGH-THROUGHPUT SRAM-BASED ARCHITECTURE FOR IP-LOOKUP USING FPGA. Hoang Le, Weirong Jiang, Viktor K. Prasanna

SCALABLE HIGH-THROUGHPUT SRAM-BASED ARCHITECTURE FOR IP-LOOKUP USING FPGA. Hoang Le, Weirong Jiang, Viktor K. Prasanna SCALABLE HIGH-THROUGHPUT SRAM-BASED ARCHITECTURE FOR IP-LOOKUP USING FPGA Hoang Le, Weirong Jiang, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los

More information

Index Terms- Field Programmable Gate Array, Content Addressable memory, Intrusion Detection system.

Index Terms- Field Programmable Gate Array, Content Addressable memory, Intrusion Detection system. Dynamic Based Reconfigurable Content Addressable Memory for FastString Matching N.Manonmani 1, K.Suman 2, C.Udhayakumar 3 Dept of ECE, Sri Eshwar College of Engineering, Kinathukadavu, Coimbatore, India1

More information

Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA

Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA Total Tranceiver BW (Gb/s) Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA Andrew Bitar, Mohamed S. Abdelfattah, Vaughn Betz Department of Electrical and Computer

More information

Homework 1 Solutions:

Homework 1 Solutions: Homework 1 Solutions: If we expand the square in the statistic, we get three terms that have to be summed for each i: (ExpectedFrequency[i]), (2ObservedFrequency[i]) and (ObservedFrequency[i])2 / Expected

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field

More information

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER International Journal of Computer Engineering and Applications, Volume VI, Issue II, May 14 www.ijcea.com ISSN 2321 3469 DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER Kiran K C 1, Sunil T D

More information

Automatic compilation framework for Bloom filter based intrusion detection

Automatic compilation framework for Bloom filter based intrusion detection Automatic compilation framework for Bloom filter based intrusion detection Dinesh C Suresh, Zhi Guo*, Betul Buyukkurt and Walid A. Najjar Department of Computer Science and Engineering *Department of Electrical

More information

Multicycle-Path Challenges in Multi-Synchronous Systems

Multicycle-Path Challenges in Multi-Synchronous Systems Multicycle-Path Challenges in Multi-Synchronous Systems G. Engel 1, J. Ziebold 1, J. Cox 2, T. Chaney 2, M. Burke 2, and Mike Gulotta 3 1 Department of Electrical and Computer Engineering, IC Design Research

More information

Data Structures for Packet Classification

Data Structures for Packet Classification Presenter: Patrick Nicholson Department of Computer Science CS840 Topics in Data Structures Outline 1 The Problem 2 Hardware Solutions 3 Data Structures: 1D 4 Trie-Based Solutions Packet Classification

More information

FPGA Based Packet Classification Using Multi-Pipeline Architecture

FPGA Based Packet Classification Using Multi-Pipeline Architecture International Journal of Wireless Communications and Mobile Computing 2015; 3(3): 27-32 Published online May 8, 2015 (http://www.sciencepublishinggroup.com/j/wcmc) doi: 10.11648/j.wcmc.20150303.11 ISSN:

More information

Pak. J. Biotechnol. Vol. 14 (Special Issue II) Pp (2017) Keerthiga D.S. and S. Bhavani

Pak. J. Biotechnol. Vol. 14 (Special Issue II) Pp (2017) Keerthiga D.S. and S. Bhavani DESIGN AND TESTABILITY OF Z-TERNARY CONTENT ADDRESSABLE MEMORY LOGIC Keerthiga Devi S. 1, Bhavani, S. 2 Department of ECE, FOE-CB, Karpagam Academy of Higher Education (Deemed to be University), Coimbatore,

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Fault Grading FPGA Interconnect Test Configurations

Fault Grading FPGA Interconnect Test Configurations * Fault Grading FPGA Interconnect Test Configurations Mehdi Baradaran Tahoori Subhasish Mitra* Shahin Toutounchi Edward J. McCluskey Center for Reliable Computing Stanford University http://crc.stanford.edu

More information

P4 for an FPGA target

P4 for an FPGA target P4 for an FPGA target Gordon Brebner Xilinx Labs San José, USA P4 Workshop, Stanford University, 4 June 2015 What this talk is about FPGAs and packet processing languages Xilinx SDNet data plane builder

More information

Cisco Nexus 9508 Switch Power and Performance

Cisco Nexus 9508 Switch Power and Performance White Paper Cisco Nexus 9508 Switch Power and Performance The Cisco Nexus 9508 brings together data center switching power efficiency and forwarding performance in a high-density 40 Gigabit Ethernet form

More information

Online algorithms for clustering problems

Online algorithms for clustering problems University of Szeged Department of Computer Algorithms and Artificial Intelligence Online algorithms for clustering problems Summary of the Ph.D. thesis by Gabriella Divéki Supervisor Dr. Csanád Imreh

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

Content Addressable Memory (CAM) Implementation and Power Analysis on FPGA. Teng Hu. B.Eng., Southwest Jiaotong University, 2008

Content Addressable Memory (CAM) Implementation and Power Analysis on FPGA. Teng Hu. B.Eng., Southwest Jiaotong University, 2008 Content Addressable Memory (CAM) Implementation and Power Analysis on FPGA by Teng Hu B.Eng., Southwest Jiaotong University, 2008 A Report Submitted in Partial Fulfillment of the Requirements for the Degree

More information

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen CSE 548 Computer Architecture Clock Rate vs IPC V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger Presented by: Ning Chen Transistor Changes Development of silicon fabrication technology caused transistor

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

IP Forwarding. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli IP Forwarding CSU CS557, Spring 2018 Instructor: Lorenzo De Carli 1 Sources George Varghese, Network Algorithmics, Morgan Kauffmann, December 2004 L. De Carli, Y. Pan, A. Kumar, C. Estan, K. Sankaralingam,

More information

EECS150 - Digital Design Lecture 16 - Memory

EECS150 - Digital Design Lecture 16 - Memory EECS150 - Digital Design Lecture 16 - Memory October 17, 2002 John Wawrzynek Fall 2002 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: data & program storage general purpose registers buffering table lookups

More information

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup: : Computer Networks Lecture 6: March 7, 2005 Fast Address Lookup: Forwarding/Routing Revisited Best-match Longest-prefix forwarding table lookup We looked at the semantics of bestmatch longest-prefix address

More information

LONGEST prefix matching (LPM) techniques have received

LONGEST prefix matching (LPM) techniques have received IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 14, NO. 2, APRIL 2006 397 Longest Prefix Matching Using Bloom Filters Sarang Dharmapurikar, Praveen Krishnamurthy, and David E. Taylor, Member, IEEE Abstract We

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Efficient Multi-Match Packet Classification with TCAM

Efficient Multi-Match Packet Classification with TCAM Efficient Multi-Match Packet Classification with Fang Yu and Randy Katz fyu, randy @eecs.berkeley.edu CS Division, EECS Department, U.C.Berkeley Report No. UCB/CSD-4-1316 March 2004 Computer Science Division

More information

Area Efficient Multi-Ported Memories with Write Conflict Resolution

Area Efficient Multi-Ported Memories with Write Conflict Resolution Area Efficient Multi-Ported Memories with Write Conflict Resolution A thesis submitted to the Graduate School of University of Cincinnati in partial fulfillment of the requirements for the degree of Master

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Dynamic Pipelining: Making IP- Lookup Truly Scalable

Dynamic Pipelining: Making IP- Lookup Truly Scalable Dynamic Pipelining: Making IP- Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University SIGCOMM 05 Rung-Bo-Su 10/26/05 1 0.Abstract IP-lookup

More information

AN EFFICIENT HYBRID ALGORITHM FOR MULTIDIMENSIONAL PACKET CLASSIFICATION

AN EFFICIENT HYBRID ALGORITHM FOR MULTIDIMENSIONAL PACKET CLASSIFICATION AN EFFICIENT HYBRID ALGORITHM FOR MULTIDIMENSIONAL PACKET CLASSIFICATION Yaxuan Qi 1 and Jun Li 1,2 1 Research Institute of Information Technology (RIIT), Tsinghua University, Beijing, China, 100084 2

More information

5. ReAl Systems on Silicon

5. ReAl Systems on Silicon THE REAL COMPUTER ARCHITECTURE PRELIMINARY DESCRIPTION 69 5. ReAl Systems on Silicon Programmable and application-specific integrated circuits This chapter illustrates how resource arrays can be incorporated

More information

A Multi Gigabit FPGA-based 5-tuple classification system

A Multi Gigabit FPGA-based 5-tuple classification system A Multi Gigabit FPGA-based 5-tuple classification system Antonis Nikitakis Technical University of Crete, Department of Electronic and Computer Engineering Kounoupidiana, Chania, Crete, GR73100, Greece

More information

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router Overview Implementing Gigabit Routers with NetFPGA Prof. Sasu Tarkoma The NetFPGA is a low-cost platform for teaching networking hardware and router design, and a tool for networking researchers. The NetFPGA

More information

A Configurable Packet Classification Architecture for Software- Defined Networking

A Configurable Packet Classification Architecture for Software- Defined Networking A Configurable Packet Classification Architecture for Software- Defined Networking Guerra Pérez, K., Yang, X., Scott-Hayward, S., & Sezer, S. (2014). A Configurable Packet Classification Architecture for

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

High Throughput Sketch Based Online Heavy Change Detection on FPGA

High Throughput Sketch Based Online Heavy Change Detection on FPGA High Throughput Sketch Based Online Heavy Change Detection on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA.

More information

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115

More information

Optimizing Packet Lookup in Time and Space on FPGA

Optimizing Packet Lookup in Time and Space on FPGA Optimizing Packet Lookup in Time and Space on FPGA Thilan Ganegedara, Viktor Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089 Email: {ganegeda,

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Routing Lookup Algorithm for IPv6 using Hash Tables

Routing Lookup Algorithm for IPv6 using Hash Tables Routing Lookup Algorithm for IPv6 using Hash Tables Peter Korppoey, John Smith, Department of Electronics Engineering, New Mexico State University-Main Campus Abstract: After analyzing of existing routing

More information

Introduction to Field Programmable Gate Arrays

Introduction to Field Programmable Gate Arrays Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Historical introduction.

More information