An efficient sparse matrix format for accelerating regular expression matching on field-programmable gate arrays

Size: px

Start display at page:

Download "An efficient sparse matrix format for accelerating regular expression matching on field-programmable gate arrays"

Juliana Carr
5 years ago
Views:

1 SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2015; 8:13 24 Published online 10 May 2013 in Wiley Online Library (wileyonlinelibrary.com)..780 SPECIAL ISSUE PAPER An efficient sparse matrix format for accelerating regular expression matching on field-programmable gate arrays Lei Jiang 1,3 *, Jianlong Tan 2 and Qiu Tang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 3 University of Chinese Academy of Sciences, Beijing, China ABSTRACT Regular expression matching is widely used in many programming languages and applications. A regular expression is transformed into a deterministic finite automata (DFA) for processing. However, the DFA requires large memory resources because of the state blowup problem. Many algorithms have been proposed to compress the DFA storage and generally store the compressed DFA in sparse matrix format. For field-programmable gate array (FPGA)-based implementations, operations on sparse matrix consume multiple clock cycles, thus reducing the flexibility and performance of applications. To accelerate the regular expression matching, we present a compact sparse matrix format for storing the compressed DFA transition table on the FPGA. Taking advantage of the special properties of sparse matrices generated by DFAs, we can accomplish one access within a single clock cycle. Furthermore, we develop a regular expression matching engine on a Xilinx (Xilinx Inc. Location: 2100 Logic Dr, San Jose, CA , USA) Virtex-6 FPGA chip using this sparse matrix format. Compared with previous solutions, this regular expression matching engine has more flexibility while keeping high compression ratio. The results show that this regular expression matching engine saves 94% of memory space compared with the original DFA structure while keeping a fast matching speed. By running multiple engines in parallel, our design achieves a throughput up to 29 Gbps. Copyright 2013 John Wiley & Sons, Ltd. KEYWORDS regular expression; DFA; sparse matrix; FPGA *Correspondence Lei Jiang, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. jianglei@ict.ac.cn 1. INTRODUCTION Regular expression provides a powerful and flexible method to match string in the text. Regular expression matching is widely used in many utilities and applications, such as text editors, programming languages, and network processing tools. Some of these languages, including Perl [1], Ruby [2], AWK [3], and Tcl [4], integrate regular expressions into the syntax of the core language itself. Other mainstream languages including C/C++ [5,6] and Java [7] also provide standard libraries for regular expression matching. Current network intrusion detection system (NIDS), such as Bro [8] and Snort [9] widely use regular expressions to depict attack signatures for its high expressiveness. For these network processing applications, regular expression matching is one of the biggest performance bottlenecks. Lots of theories and algorithms [10 12] on regular expressions have been proposed since 1960s. Typically the finite automaton (FAs) are classified by deterministic finite automata (DFA) and nondeterministic finite automata (NFA). DFA activates only one state transition for each input character, whereas NFA activates multiple transitions per character. Therefore, the DFA algorithm s searching complexity is O(1), and provides a fast and stable matching speed. For this reason, mainstream NIDS (snort, bro etc.) prefers DFA to perform regular expression matching. But as the rule sets are becoming increasingly complex and large, DFAs suffer from the state blowup problem. For example, the L7-filter s [13] regular expression rule set consumes more than 16 GB memory space [14] when compiled by normal DFA algorithm. So it is crucial to reduce the memory consumption of DFA to satisfy the complex and high speed networking environments. For this Copyright 2013 John Wiley & Sons, Ltd. 13

2 A sparse matrix format for regular expression matching target, many algorithms have been proposed to eliminate the redundancies in the DFA transitions and compress the DFA storage, such as D 2 FA [15] and ıfa [16,17]. Sparse matrices are ubiquitous in these algorithms to represent and store the compressed DFA transition table, yielding significant savings in memory usage. Existed sparse matrix formats, such as the compressed row storage [18], the jagged diagonal format [19], and compressed diagonal storage format [20], achieve different level of space efficiency and operation efficiency. However, for a regular expression matching application implemented on hardware, these formats decrease the performance and matching speed because the access to the sparse matrix elements consumes multiple clock cycles. In this paper, we present a novel architecture for sparse matrix storage on field-programmable gate array (FPGA). Observing the sparse matrices generated from DFAs, we notice two interesting features. First, the row size is much larger than the column size. For example, the column size of sparse matrix is always 256 for the extended ASCII table. Second, most of the nonzero elements are located in the specific columns. We adopt two techniques to implement our architecture: memory packing and interleaved memory. Taking advantage of these techniques, our design succeed to accomplish one access within a single clock cycle while keeping a high space efficiency. Utilizing the new sparse matrix architecture, we implement a regular expression matching engine inspired by the algorithm proposed by Y. Liu et al.[21], which compresses the DFA storage space by matrix decomposition. The architecture proposed in this paper has been targeted to the Xilinx Virtex-6 FPGA chip. The experimental results show that the proposed architecture achieves a throughput of nearly 30 Gbps and saves about 94% memory space at the best case. In summary, the main contributions of this paper are: (1) We present a novel compact sparse matrix format that can efficiently store the compressed DFA transition table. (2) On the basis of the new sparse matrix format, we design an FPGA-based architecture. By means of the parallelism of FPGA, this architecture can accomplish one DFA state retrieval in a single clock cycle. (3) We build a regular expression circuit according to a simple DFA compression algorithm on an FPGA board. We store the compressed transition table in our sparse matrix format, giving a flexible regular expression matching engine while saving more than 90% memory space. The rest of the paper is organized as follows. Section 2 presents the preliminary knowledge of our work, including the knowledge of regular expression, finite automata, and sparse matrix, then summarizes the related work in literature. Section 3 presents our sparse matrix format and an architecture for sparse matrix storage. Section 4 introduces a regular expression matching engine using our sparse matrix architecture. Section 5 presented experiment results. Finally, Section 6 concludes this paper. 2. RELATED WORKS 2.1. Introduction to FPGAs In this paper, the design is proposed and implemented on the basis of FPGA chips. First, we introduce the knowledge of FPGA. An FPGA is an integrated circuit designed to be configured by a customer or a designer after manufacturing hence field-programmable [22]. Figure 1 shows the architecture of FPGA consisting of clustered logic blocks(clbs), I/O cells, interconnection resources, and switch blocks. CLBs are the basic programmable units of FPGA, and they are connected together via interconnection resources and switch blocks. After burning a new circuit design into FPGA, the CLBs are reconnected by configuring the interconnection resources. Currently, graphics processing unit (GPU) [23] and multi-core central processing unit [24] are other two popular parallel technologies. Compared with FPGA, GPU is more powerful and requires data transferring between main memory and GPU cells, increasing the latency. Multi-core central processing unit is easier to program and compile, but the parallelism degree and performance are much lower than FPGA. In this paper, we take FPGA as our implementation platform for better performance Regular expression grammar Regular expressions are used for searching specify strings containing a particular pattern in a text. A regular expression is a pattern consisting of ASCII characters and some meta characters. Not like the string patterns, the regular expression patterns depict the characteristics using the meta characters, so that a regular expression can describe a set of strings without enumerating them explicitly. Table I lists the common grammar of regular expressions. For example, consider a regular expression (a b)cd. This pattern matches any string that starts with ASCII character a or b, then follows some arbitrary characters, and ends with ASCII letters cd. Figure 1. Structure of field-programmable gate array. 14 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

3 A sparse matrix format for regular expression matching Table I. Grammars of regular expression. Meta character Meaning Description. A single character wildcard Matches any character OR relationship Separates alternate possibilities? A quantifier denoting zero or one Matches the preceding pattern element zero or one time * A quantifier denoting zero or more Matches the preceding pattern element zero or more times + A quantifier denoting one or more Matches the preceding pattern element one or more times {M,N} Repeat from M to N times Denotes the minimum M and the maximum N match count [] A class of characters Denotes a set of possible character matches. [abc] denotes a letter a, b, orc Figure 2. Example of a deterministic finite automata compression algorithm Finite automata Finite automata is a natural formalism for regular expressions. Every language defined by a regular expression is also defined by FAs. It is a well-established fact that each regular expression can be transformed into an FA [25]. Two kinds of FA are mostly used in the regular expression matching: DFA and NFA. In this section, we mainly talk about the DFA solution. A DFA is commonly denoted as a 5-tuple: (Q,,, q0, Fin), where Q is a finite set of states, is a finite set of input symbols (usually the ASCII alphabet, including 256 characters), is a transition function, q0 is a start state, and Fin is a set of accepting states. The transition function describes how a DFA state transforms to another DFA state. Figure 2 is the DFA corresponding to the regular expression (a b)cd. It is worth noting that in Figure 2 we omit a lot of transitions: when the input character is not labeled in the figure, the next state will move to state 1. Obviously, there is a large sum of transitions in the DFA. In theory, when a regular expression is converted into a DFA, it may generate O( n ) states, where n is the length of regular expression. This means that at most we may need a memory space as large as O(256 n )to store a DFA, unacceptable for modern computer systems. This space problem is critical, especially when compiling multiple regular expressions into a composite DFA Realization of NFA State-of-the-art NIDS prefer regular expressions to depict attack signatures and process the packets for its powerful expressiveness. Traditionally, DFA and NFA are used to Figure 3. Sidhu s conversion algorithm: (a) single character, (b) union of N1 and N2, (c) concatenation of N1 and N2, and (d) repetition of zero or more than one (R*). implement regular expressions matching. The space complexity of NFA is O(n), and its searching complexity is O(n 2 ), whereas DFA s storage complexity is O(2 n ) in the worst case, and its searching complexity is O(1), where n is the length of regular expression. Floyd and Ullman showed that an NFA regular expression circuit can be implemented efficiently using programmable logic array architecture [26]. Sidhu et al. [27] and Clark et al. [28] showed that NFA is an efficient method in terms of processing speed and area efficiency for implementing regular expressions on FPGAs. The conversion from the NFA to Sidhu s circuit is shown in Figure3. In Sidhu s circuit, each state of NFA is implemented by a cascade of logic cells (LCs) of FPGA. Thus, the consumed LC resources is proportional to the number of states. Meanwhile, the clock frequency of designed circuit becomes lower, decreasing the performance of regular expression Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 15

4 A sparse matrix format for regular expression matching matching. For this reason, our matching circuit is based on the DFA method, so we mainly talk about the DFA algorithms in the following section DFA algorithms Compared with NFA-based algorithms, DFA-based algorithms usually consume masses of memories but provide high matching speed. For NIDS, DFA algorithm is more appealing because of its deterministic behavior and high throughput. To conquer the limitations of DFA, many algorithms have been proposed to compress the memory space and improve the performance of regular expression matching. Fang Yu et al. [29] study and try to resolve the state blowup problem of DFAs. They find that memory requirements using traditional methods are prohibitively high for typical patterns used in real-world packet payload scanning applications. Then, they propose regular expression, rewrite techniques to reduce memory usage and a group scheme to regular expression rule sets to several groups. However, their rule rewriting depends on the rule sets. It is possible that new attack signatures lead DFAs to become invalid come-out. In this case, new signature structures have to be studied. Kumar et al. [15] observed that two states (S1 and S2) have many similar next state transitions (T) for an input characters subset. According to this observation, they proposed a new algorithm called D 2 FA to compress the transition table. D 2 FA eliminates S1 s transitions (T) by introducing a default transition from S1 to S2. The experimental results show that a D 2 FA reduces transitions by more than 95% compared with the original DFA. However, D 2 FA s transition mechanism is possible to look up memory multiple times per input character, leading to a higher memory band. On the basis of the observation that most adjacent states share a large part of identical transitions, Ficara et al. [16] present a new representation for DFA, called delta finite automata (ıfa). They record the transition set of current state into a local memory, and only store the differences between current state and next hop state. In this way, ıfa achieves very good compression effect. In addition, this algorithm requires only a state transition per character (keeping the characteristic of standard DFAs), thus allows a fast string matching speed. Qi Y et al. [30] proposed a new compression yalgorithm named FEACAN for DFA transition table. FEACAN introduced a two-dimensional compression algorithm, utilizing the intrastate redundancy and interstate redundancy of DFA to reduce the memory consumption. The author used a bitmap to process the intrastate compression and a two-stage grouping algorithm to process the interstate compression. In addition, other techniques were adopted to further improve the performance of FEACAN, such as input-interleaving inputs and dual-lookup pipeline. Experimental results showed that FEACAN could achieve a throughput of 40 Gbps on FPGA chips. However, this architecture consumed four clock cycles to process one character, and this weakness limited the deployment of FEACAN. T. Liu et al.[14] introduce a new compression algorithm, which can reduce memory usage of DFA stably without significant impact on matching speed. They observe the characteristic of transition distribution inside each state, and find that above 90% of transitions in DFAs transfer to the initial state or its near neighbors, which are called magic states by Becci in [31]. By this observation, they divide all the transitions and store them into three different matrices and compress these matrices. Experiment results show that this algorithm save memory space by 95% with only 40% loss of matching speed comparing with original DFA. Y. Liu et al.[21] have presented a new DFA matrix compressing algorithm named column row decomposition (CRD). This algorithm is to decompose the DFA transition table into a column, a row vector, a sparse matrix to reduce the storage space as much as possible. Experiments on typical rule sets show that the proposed method significantly reduces the memory usage and still runs at fast searching speed. The algorithms mentioned earlier focus on eliminating the redundancies of DFA transition table and generally store the compressed DFA transition table in sparse matrix format. One widely used sparse matrix format is compressed row storage (CRS) [18]. The sparse matrix is decomposed with three vectors: row vector VR, column vector VC, and matrix values vector VV. The VV vector stores the values of the nonzero elements of the sparse matrix. The column vector VC contains the original column positions of the corresponding elements in VR. The row vector VR contains the position of the first nonzero element of each row located in vector VC. Another scheme to store the sparse matrix is the sparse block compressed row storage (BCRS) [32], which divides a sparse matrix into multiple blocks. Similar to the CRS format, three arrays are required for the BCRS format: a rectangular array, which stores the nonzero blocks in (block) row-wise fashion; an integer array, which stores the actual column indices in the sparse matrix elements of the nonzero blocks; and a pointer array, whose entries point to the beginning of each block row. The savings in storage for BCRS and CRS can be significant for normal sparse matrices. But they do not take into account the particular sparse matrix format generated by DFA compression algorithms DFA compression and sparse matrix A lot of works have been presented to solve the space problem of DFA by compressing the DFA transition table [14 16,21,30]. The key point of the compression techniques is to eliminate the redundancies of the DFA transition table, in which the sparse matrix is a very useful tool to store the compressed transition table. In the following paragraphs, we will explain these technical details. 16 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

5 A sparse matrix format for regular expression matching Figure 4. Example of a deterministic finite automata compression algorithm. Figure 5. A typical sparse matrix. The state transition table of a DFA can be considered as an mn matrix A, where m is the number of the states and n is the alphabet size. The matrix contains nm elements. Each element A[i, j] defines the state switching from state i to the next state through the input character label j. The DFA compression algorithms focus on eliminating the redundancies of the transition matrix, and store the compressed transition table in a sparse matrix format. Figure 4 is an example of a DFA compression algorithm. The left part of Figure 4 depicts the transition matrix corresponding to a DFA on the alphabet {a, b, c, d} that recognizes the regular expressions (a+), (b + c), and (cd+). And the right part is the sparse matrix after eliminating the same transitions. Obviously, the key problem of the compression algorithm is the storage and access of sparse matrix. However, previous sparse matrix formats usually take the software-based method, without concerning the situation of hardware. In the following section, we develop a compact sparse matrix format with a high efficiency of access, based on the parallelism of FPGA. 3. SPARSE MATRIX FORMAT DESIGN In this section, we present a novel sparse matrix format for efficient storage on FPGA using three techniques: memory packing, index table and interleaved memories. Then, we illustrate the technical details Sparse Matrix Format The storage of sparse matrix is the key problem for the whole matching circuit. A sparse matrix contains very few nonzero elements, usually less than 1%. Obviously, there is no need to store a lot of zeros, and many methods have been presented utilizing the sparse structure of the matrix. For example, a typical sparse matrix R is shown in the left part of Figure 5. The sparse matrix R has seven nonzero elements, whereas the total number of elements is 30, with a lot of redundancies. If the zeroes of sparse matrix are eliminated, we can save a lot of memory space. In fact, there have existed many sparse matrix storage schemes, such as the CRS [18], the linked list [33], and the jagged diagonal format [32]. The example of CRS for- Figure 6. Compressed row storage format. mat is shown as Figure 6. We continue to take the matrix in Figure 5 as the example. The matrix is split into three vectors: row vector Vec_row, column vector Vec_col, and matrix values vector Vec_val. The vector Vec_val contains all the values of the nonzero elements of the matrix. The vector Vec_col of the length equal to Vec_val contains the original column positions of the element in Vec_val. The element in Vec_row points to the first element of each row in Vec_val and Vec_col. This method is rather straightforward. Another efficient way to store the sparse matrix is the parse BCRS format [32]. The main idea of this method is to split the matrix to several blocks. Each block stores every nonzero matrix element in a single vector, including its value and position information. Then, a CRS scheme is performed on the block arrays. However, both algorithms take multiple queries to look up one element [34], inefficient for the hardware-based DFA storage and access Sparse matrix storage and access In this section, we use a memory packing scheme to compress the sparse matrix: joining all the nonzero elements together, and ignoring all the zeroes in the matrix, as shown in the right part of Figure 7. We exploit three techniques in our sparse matrix architecture: memory packing, index table, and interleaved memories. To explain our Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 17

6 A sparse matrix format for regular expression matching Figure 7. The compact sparse matrix storage structure. Figure 9. Interleaved memories for sparse matrix storage. Figure 8. Index table for sparse matrix R. method more clearly, we take an example to illustrate these techniques. The memory packing technique is shown in the right of Figure 7. This scheme drops all of the zeros of sparse matrix, achieving a considerable compression ratio, but it leads to another problem we lose the position information of the matrix elements. To address this issue, we develop a novel access scheme, which can accomplish the access operation in a single clock cycle. This is very important to improve the throughput for the FPGA-based regular expression matching engines. The first technique adopted is an index table, which indicates the start position of matrix elements in each row. For the embedded memory on current FPGA, the FPGA chips can read multiple bits (we call these bits one word in this paper) in one clock cycle, and the word width (how many bits a word contains) can be configured manually. Utilizing this feature, we can adjust how many matrix elements one word contains in accordance with specific conditions. The right part of Figure 8 shows the matrix R in Figure 6 packed into a memory layout, which contains two matrix elements in a word. The left part of Figure 8 is the structure of the index table. ROWID indicates the row number of the sparse matrix. INDEX indicates the index of the first word to read in the memory. OFFSET indicates the first element s location in the word. #NZ indicates the number of nonzero elements in each row of the sparse matrix. Once we retrieve the index table by ROWID, the word associated with the INDEX is read. Then #NZ consecutive elements are read starting at position OFFSET. For example, when a ROWID = 1 comes to retrieve the corresponding elements, by inquiring the index table entry, INDEX =0,OFFSET =1,#NZ = 2, the word 0 is read, and two continuous elements start at position 1 are taken out as the query results. Note that when ROWID = 4, the nonzero elements of one row distribute in two different words. In this case, we read two consecutive words from memory, consuming two clock cycles, resulting in a significant decrease in throughput. Using the parallel processing ability of FPGA, we adopt the technique of interleaved memories to deal with this problem. On modern FPGAs, there are hundreds of on-chip memory banks, which can be read or written concurrently [35]. On the basis of this feature, we store the sparse matrix in multiple on-chip memory banks. If the nonzero elements to be read span across M words, we distribute these elements to M parallel memory banks. M can be calculated by d NZ max w e, where w is the number of matrix elements one word contains and NZ max is the largest number of nonzero elements one row contains in the sparse matrix. We show the interleaved memories technique in Figure 9. In Figure 9, we split the original memory into two memory banks. One bank stores the odd numbered words, and the other bank stores the even numbered words. By this means, we are able to read any two consecutive words in a single read cycle. For example, one access wants to read an element in S3, which spans across two continuous words: word 1 and word 2. Using the interleaved memories, word 1 and word 2 can be read from two different memory banks simultaneously, consuming only one clock cycle. Comparing with the original memory layout, this architecture saves one clock cycle, thus improves the performance by about one time. Because the memory architecture is modified, we also adjust the index table design to accommodate the new structure. When executing the access operation, the information in which bank the starting word is located must be informed in advance. So we add one bit flag in the index table entry to label this information. Figure 10 shows the modified index table structure. Bank0 and Bank1 in Figure 10 correspond to the two memory banks in Figure 9, respectively. The FLAG in the index table indicates in which banks the starting word located. FLAG =0 means that the starting word is located in Bank0, whereas FLAG = 1 means that the starting word is located in Bank1. Assuming that a query with ROWID equaling to 1 comes to retrieve the corresponding sparse matrix element, by inquiring the index table entry 1, we find out FLAG is 0, INDEX is 0, OFFSET is 1, and #NZ is 2. Then, according to the FLAG and INDEX, we go to Bank0 to read 18 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

7 A sparse matrix format for regular expression matching Figure 12. Model of deterministic finite automata (DFA) engine. Figure 10. Modified index table structure. one input using n clock cycles, the engine will halt n 1 cycles to wait for the next state to work out. Our architecture avoids this embarrassing pipeline stall by processing one input character within one clock cycle. 4. HARDWARE IMPLEMENTATION FOR REGULAR EXPRESSION MATCHING Figure 11. Combine two words from multiple memory banks to a wide word. the starting word 0. Because OFFSET + #NZ is greater than w (recalling w is the number of matrix elements one word contains), which means the sparse matrix elements of row 1 spanning across two different memory banks. We continue to read word 0 in Bank1 to get the remaining elements. Then, we combine the two words to one wide word, whose width is 2w. From the wide word, we take out #NZ elements starting from position OFFSET. This process is shown in Figure 11. By this means, we manage to access the sparse matrix within one clock cycle Other details In our prototype implementation, we use two memory banks. This is because in our experiments, the NZ max (recalling that NZ max is the largest number of nonzero elements in one row) is 73, and two banks are enough to store the 73 nonzero elements. If the rule set is more complex, the NZ max is very likely to become greater. In this case, two memory banks are not enough to contain these elements. This problem can be solved by adding more memory banks in our design. Because NZ max will never be greater than 256 (256 is the size of extended ASCII table), the calculation shows that four memory banks satisfy any regular expression rule set. Most of the DFA-based matching algorithms can be abstracted to a model in Figure 12. The DFA engine looks up the transition table to get the next state, and then put output state as the next input state in the next clock cycle. This process brings in a feedback to the circuit. Because of the existence of feedback, if the DFA engine processes In the previous section, we propose a sparse matrix architecture and introduce its technical details. In this section, we will implement a regular expression matching engine using this architecture. Inspired by the DFA decomposition algorithm proposed by Y. Liu [21], we develop a regular expression matching engine adopting this architecture on FPGA chips. In this section, we depict the architecture of our regular expression matching engine Main idea of DFA decomposition algorithm Y. Liu presented a software-based DFA matrix compression algorithm named CRD. As shown in Figure 13, the basic idea of CRD is to decompose the DFA transition table (matrix A) into a column vector X of size m, a row vector Y of size n, and a sparse matrix R (that can be stored with little space) to reduce the storage space. When to access the matrix element, A[i, j] can be calculated as A[i, j] =X[i]+Y[j]+R[i, j]. A[i, j] is the element of matrix A, X[i] istheith element in vector X, Y[j] is the jth element in vector Y, and R[i, j] is the corresponding element of sparse matrix R. This DFA compression algorithm is simple and well suited for hardware implementation. In [21], Liu stores the nonzero elements of each row in a sorted array, and accesses an element by doing binary searching on it. This method is practical for software and the general purpose processors, but is inappropriate for hardware implementation Regular expression matching architecture The overall structure of our regular expression matching engine is shown in Figure 14. The component in the top Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 19

8 A sparse matrix format for regular expression matching Figure 13. Decomposition of deterministic finite automata transition matrix. dashed box is the regular expression compiler, and the component in the bottom dashed box is the matching circuit. After compiling the regular expressions rules into a DFA, the compiler transforms DFA transition matrix into a row vector X, a column vector Y, and a sparse matrix R, and then writes the result to the embedded memory of the FPGA chip. The matching process is as follows: the current state pointer and the input character are combined to determine whether the nonzero elements of sparse matrix are hit. If hit, the next state is equal to X[i] +Y[ j ]+R[i, j]. Else, the next state is equal to the X[i]+Y[ j ]. Figure 15 depicts the architecture of the matching circuit in detail. The architecture consists of a module of looking up vector X (here we call it C_VECTOR, because in the transition table matrix, the row index equals to the input character), a module of looking up vector Y (we call it S_VECTOR, because the column index equals to the current state), and a module looking up the sparse matrix R. EVEN_CM and ODD_CM represent Bank0 and Bank1 in Figure 10, respectively. As discussed previously, the nonzero elements are stored in these two memory banks. Assuming current state is s, when a new character c comes, we obtain the corresponding value X[c] and Y[s], respectively, by querying module C_VECTOR and module S_VECTOR. Querying the index table and memory banks, we obtain all the nonzero elements located in the sth row of sparse matrix R. Yet, if we want to find out the exact element, we need additional information to determine whether the input character hit the sparse matrix and in which location to take out the correct element value. So we redesigned the form of element stored in the memories by replacing the <R[i, j]> with <R[i, j], y>, where R[i, j] is the element value of sparse matrix R, and y is the corresponding column index. The input character is compared with y to determine whether the input character c hits the sparse matrix element. As shown in Figure 15, the hit wire connects to the SEL port of the multiplier "Mux". If the sparse matrix is hit, the corresponding sparse matrix element value R[i, j] works out simultaneously. In this case, the next state equals to X[c] +Y[s] +R[c, s]. If the sparse matrix is not hit, the next state equals to X[c]+Y[s]. Figure 14. Overall structure of regular expression matching engine. 20 Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

9 A sparse matrix format for regular expression matching Figure 15. Detailed structure of regular expression matching circuit. Table II. Memory usage results. Rule sets bro217 snort24 snort31 snort34 No. of states No. of NZ Baseline DFA memory size (MB) Memory size of our work (MB) Comp. ratio (%) DFA, deterministic finite automata. Table III. Frequency and throughput results. Rule sets bro217 snort24 snort31 snort34 No. of states LUT REG fmax (MHz) Throughput (Mbps) PERFORMANCE EVALUATION 5.2. Memory usage evaluation We evaluate our regular expression matching engine from two aspects: storage efficiency and throughput. First, we study the memory usage and compression ratio using some real-life regular expression rule sets. Then, we implement a prototype design with a Xilinx Virtex-6 FPGA chip, and investigate the matching speed and memory consumption. Finally, we compare and analyze the experimental results with several previous works. In our architecture, the index table is stored in the distributed memory (which is implemented using LUTs and REGs in FPGA), whereas C_VECTOR, S_VECTOR, EVEN_CM and ODD_CM are stored in the embedded memory, so the minimum memory consumed Smin can be calculated as: 5.1. Testbench Nc is the finite set size of input char, Ns is the finite set size of states, and NZ is the number of nonzero elements in sparse matrix. If not compressed, a baseline DFA engine requires storing the whole transition table. So, the compression ratio r can be calculated as We select four sets of regular expressions rules from Snort and Bro: snort24.re (with 24 rules from Snort), snort31.re (with 31 rules from Snort), snort34.re (with 34 rules from Snort), and bro217.re (with 217 rules from Bro). We implement the prototype on a Xilinx Virtex-6 FPGA chip(xc6vsx475t: LCs, 7640 Kb distributed RAM, total Kb BRAM). Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. Smin = Nc dlg Nc e+ns dlg Ns e+nz (dlg Nc e+dlg Ns e) r= Smin Nc Ns lg Ns 21

10 A sparse matrix format for regular expression matching We list the experimental results of memory usage in Table II. From Table II, we conclude that the compression ratio of our architecture is more than 90% in average. The compression ratio on bro217 is the best, about 94%. Another observation is that the compression ratio is related with not only the number of states, but the number of nonzero elements of sparse matrix. Although the states number of snort24 is much smaller than that of snort34, more nonzero elements of snort24 still lead to a much lower compression ratio Speed evaluation We write our regular expression matching engine in Verilog language and implement the prototype on one Virtex-6 FPGA chip, using two memory banks to store the sparse matrix elements. We simulate our design using ModelSim simulator, and synthesize with ISE synplify tool chain. Because our design processes one input character in a clock cycle, we can evaluate the throughput by the clock frequency exactly. In addition, we also list the LUTs and registers consumption because these factors also affect the frequency to some extent. Table III is the throughput results in our experiments. In terms of theory, more DFA states and more nonzero elements lead to the increase of LUT resource consumption, subsequently decrease the clock frequency. But the experimental results show that when the states number increases from 5389 of snort31 to of snort34, the frequency only decreases from to MHz. This result shows that our architecture is able to keep a steady throughput under different rule sets, which is of great importance for NIDS. We achieve scalable performance by running multiple regular expression matching engines in parallel on FPGAs. The number of parallel engines is Table IV. Compression ratio compared with other implementations. Method Compression ratio Clock cycles per input DFA 0 1 DPICO [35] 65% 1 CPDFA [36] 90% >= 2 FEACAN [30] 90% 4 Our work 90% 1 DFA, deterministic finite automata. mainly determined by the size of on-chip memories of FPGA chip. More experimental results is shown in Table Comparison with other implementations We compare the compression ratio and throughput of our design with other implementations. In Table IV, we list the compression ratios and clock cycles per input character of different method. Specially, we compare the throughput of our results with FEACAN [30] using the same rule sets and on the same FPGA chip. The results are shown in Table V. From Table V, our clock frequency and throughput is slower than FEACAN, but FEACAN consumes four clock cycles to process one input character, which means the throughput per clock should be four times lower. By contrast, the design based on our sparse matrix format can process one input character in a single clock cycle (compared FEACAN consuming four clock cycles per input character), the throughput of our implementation per clock cycle is higher than FEACAN. If new techniques such as dual port SRAM presented in [30] are used, the speed could be doubled in theory, which means we will obtain a throughput of about 57 Gbps on bro217 rule set and snort31 rule set. 6. CONCLUSION AND FUTURE WORK In this paper, we focus on solving the storage problem of compressed DFA transition matrix on FPGAs. We present a new architecture for sparse matrix storage and access. Our architecture takes advantage of the special properties of sparse matrices generated by DFAs, significantly improving the flexibility and efficiency of FPGA-based applications. In this architecture, we adopt some new techniques to reduce the memory space, including memory packing, index table, and interleaved memory banks. Then, we propose a regular expression matching engine on one Xilinx Virtex-6 FPGA chip. Finally, we use four groups of real-life regular expression rule sets to evaluate the regular expression matching engine. The results show that our design saves 90% of memory space in average, and in the best case saves 94% of memory space. Then, we try to run multiple engines in parallel on the FPGA and Table V. Throughput compared with FEACAN. Rule sets Method No. of engines f max (MHz) Throughput (Gbps) Throughput per clock (Gbps) bro217 FEACAN bro217 our work snort24 FEACAN snort24 our work snort31 our work snort34 our work Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

11 A sparse matrix format for regular expression matching achieve a throughput of 7 Gbps using the snort24 rule set, and 29 Gbps using the snort31 rule set. Because we can accomplish one lookup per clock cycle, compared with previous solutions, our regular expression matching engine has more flexibility while keeping a high compression ratio. It needs emphasizing that the sparse matrix storage architecture can be used in various DFA compression algorithms. For simplicity, in this paper, we implemented it by the DFA transition matrix decomposition algorithm. The experimental results proved the feasibility and efficiency of our method. In the future, we will implement our design on multiple DFA compression algorithms. ACKNOWLEDGEMENTS This work has been partially funded by the National High- Tech Research and Development Plan 863 of China, under grants 2011AA and 012AA012502; the National Natural Science Foundation of China (NSFC) under grant ; and the Special Pilot Research of the Chinese Academy of Sciences under grant XDA REFERENCES 1. Wall L, et al. The perl programming language, Flanagan D, Matsumoto Y. The Ruby Programming Language. O Reilly Media: Sebastopol, California, Robbins A D. The gnu awk users guide, Ousterhout JK, Jones K. Tcl and the Tk Toolkit, Vol Addison-Wesley Reading: Boston, Massachusetts, Kernighan BW, Ritchie D, Lippman SB, Lajoie J. C programming. Language 2009, In press. 6. Stroustrup B, Online STB. The C++ Programming Language, Vol. 3. Addison-Wesley Reading: Boston, Massachusetts, Gosling J, Joy B, Steele G, Bracha G. Java (TM) Language Specification, The (Java (Addison-Wesley)). Addison-Wesley Professional: Boston, Massachusetts, Paxson V. Bro: a system for detecting network intruders in real-time. Computer Networks 1999; 31(23-24): Roesch M, et al. Snort-lightweight intrusion detection for networks, Proceedings of the 13th Usenix Conference on System Administration, Seattle, Washington, 1999; Baeza-Yates RA, Gonnet GH. Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM (JACM) 1996; 43 (6): Myers G. A four russians algorithm for regular expression pattern matching. Journal of the ACM (JACM) 1992; 39(2): Thompson K. Programming techniques: regular expression search algorithm. Communications of the ACM 1968; 11(6): Levandoski J, Sommer E, Strait M, et al. Application layer packet classifier for linux, Liu T, Yang Y, Liu Y, Sun Y, Guo L. An efficient regular expressions compression algorithm from a new perspective, 2011 Proceedings IEEE on INFOCOM, IEEE, Shanghai, 2011; Kumar S, Dharmapurikar S, Yu F, Crowley P, Turner J. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. ACM SIGCOMM Computer Communication Review 2006; 36 (4): Ficara D, Giordano S, Procissi G, Vitucci F, Antichi G, Di Pietro A. An improved DFA for fast regular expression matching. ACM SIGCOMM Computer Communication Review 2008; 38(5): Ficara D, Di Pietro A, Giordano S, Procissi G, Vitucci F, Antichi G. Differential encoding of DFAs for fast regular expression matching. IEEE/ACM Transactions on Networking (TON) 2011; 19 (3): Bai Z. Templates for the Solution of Algebraic Eigenvalue Problems, Society for Industrial Mathematics, vol. 11: Philadelphia, PA, Saad Y. Krylov subspace methods on supercomputers. SIAM Journal on Scientific and Statistical Computing 1989; 10(6): Dongarra J. Sparse matrix storage formats. Templates for the Solution of Algebraic Eigenvalue Problems: a Practical Guide. SIAM 2000; 11: Liu Y, Guo L, Liu P, Tan J. Compressing regular expressions DFA table by matrix decomposition. Implementation and Application of Automata 2011; 6482: Wikipedia. Field-programmable gate array Wikipedia, the free encyclopedia, wikipedia.org/wiki/field-programmable_gate_array. 23. Nickolls J, Buck I, Garland M, Skadron K. Scalable parallel programming with CUDA. Queue ; 6 (2): Wikipedia. Multi-core processor Wikipedia, the free encyclopedia, Multi-core_processor. 25. Brüggemann-Klein A. Regular expressions into finite automata. Theoretical Computer Science 1993; 120(2): Floyd RW, Ullman JD. The compilation of regular expressions into integrated circuits, 21st Annual Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd. 23

12 A sparse matrix format for regular expression matching Symposium on Foundations of Computer Science, 1980, Syracuse, New York, 1980; Sidhu R, Prasanna VK. Fast regular expression matching using fpgas. In IEEE Symposium on Field- Programmable Custom Computing Machines. IEEE: Rohnert Park, California, 2001; Clark CR, Schimmel DE. Efficient reconfigurable logic circuits for matching complex network intrusion detection patterns. In Proceedings Of 13Th International Conference on Field Program. IEEE: Lisbon, Portugal, 2003; Yu F, Chen Z, Diao Y, Lakshman TV, Katz RH. Fast and memory-efficient regular expression matching for deep packet inspection. In ANCS ACM/IEEE Symposium on Architecture for Networking and Communications Systems, IEEE: San Jose, California, 2006; Qi Y, Wang K, Fong J, Xue Y, Li J, Jiang W, Prasanna V. Feacan: Front-end acceleration for content-aware network processing. In 2011 Proceedings IEEE on INFOCOM. IEEE: Shanghai, 2011; Becchi M, Crowley P. An improved algorithm to accelerate regular expression evaluation. In Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems. ACM: Orlando, Florida, 2007; Vassiliadis S, Cotofana S, Stathis P. Block based compression storage expected performance. Kluwer International Series in Engineering and Computer Science 2002; 657: Hu J, Wang W. Algorithm research for vectorlinked list sparse matrix multiplication. In 2010 Asia- Pacific Conference on Wearable Computing Systems (APWCS). IEEE: Kaohsiung, Taiwan, 2010; Smailbegovic F, Gaydadjiev GN, Vassiliadis S. Sparse matrix storage format, Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, Veldhoven, The Netherlands, 2005; Hayes CL, Luo Y. Dpico: a high speed deep packet inspection engine using compact finite automata. In Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems. ACM: Orlando, Florida, 2007; Lin W, Tang Y, Liu B, Pao D, Wang X. Compact DFA structure for multiple regular expressions matching. In ICC 09. IEEE International Conference on Communications, IEEE: Dresden, Germany, 2009; Security Comm. Networks 2015; 8: John Wiley & Sons, Ltd.

Large-scale Multi-flow Regular Expression Matching on FPGA*

212 IEEE 13th International Conference on High Performance Switching and Routing Large-scale Multi-flow Regular Expression Matching on FPGA* Yun Qu Ming Hsieh Dept. of Electrical Eng. University of Southern