Large-scale Multi-flow Regular Expression Matching on FPGA*

Size: px

Start display at page:

Download "Large-scale Multi-flow Regular Expression Matching on FPGA*"

Mildred Bridges
5 years ago
Views:

1 212 IEEE 13th International Conference on High Performance Switching and Routing Large-scale Multi-flow Regular Expression Matching on FPGA* Yun Qu Ming Hsieh Dept. of Electrical Eng. University of Southern California Yi-Hua E. Yang Network Division Huawei North America Viktor K. Prasanna Ming Hsieh Dept. of Electrical Eng. University of Southern California Abstract High-throughput regular expression matching (REM) over a single packet flow for deep packet inspection in routers has been well studied. In many real-world cases, however, the packet processing operations are performed on a large number of packet flows, each supported by many run-time states. To handle a large number of flows, the architecture should support a mechanism to perform rapid context without adversely affecting the throughput. As the number of flows increases, large-capacity memory is needed to store per flow states of the matching. In this paper, we propose a hardware-accelerated context mechanism for managing a large number of states on memory efficiently. With sufficiently large off-chip memory, a state-of-the-art FPGA device can be multiplexed by millions of packet flows with negligible throughput degradation for large-size packets. Post-place-androute results show that when 8 characters are matched per, our design can achieve 18 MHz clock rate, leading to a throughput of 11.8 Gbps. Index Terms Deep packet inspection, packet flow, context, FPGA, off-chip memory I. INTRODUCTION High-speed packet processing with large amount of state information is becoming an essential function of the network routers. For example, deep packet inspection (DPI) utilizing regular expression matching (REM) [1, 2] has been used for detecting malicious patterns in packet flow (see Section II-B). Most of the packet processing tasks, such as DPI using REM, require keeping increasingly large amount of states per packet flow [3, 4, 5, 6]. Specifically, packet processing engines keep track of current states and generate various outputs based on the saved states and input, making the states very important information to be recorded during run-time. Meanwhile, a major concern has been the rapidly growing number of packet flows and increasing network bandwidth. The aggregated internet traffic has been experiencing an annual bandwidth growth of 4%~5% from 22 to 21 [7], and the number of concurrent packet flows has increased to over millions in backbone routers. As a consequence, efficient mechanism is needed both to process packets at high throughput and to multiplex the packet processing engine by a large number of packet flows. * This work is supported by the U.S. National Science Foundation under grant CCR-11881; Equipment grant from Xilinx Inc. is gratefully acknowledged. In most cases, FPGA-based REM solutions [6, 8] only address the problem of matching a set of regular expressions (regexes) against a single packet flow. The traffic on a highspeed network link, on the other hand, usually consists of over thousands of packet flows at any time. In order to multiplex an existing single-flow REM solution [6, 8] by multiple packet flows, the REM system must have high-bandwidth access to the state context (see Section III) of every packet flow at run time [9]. The number of packet flows supported by the REM system is thus restricted by the size of the on-chip memory used for context storage. In general, high-bandwidth on-chip memory (e.g., distributed RAM on FPGA), as is used in [9], has limited size and is insufficient to hold the state context for more than hundreds of packet flows. Thus we need to explore off-chip memory in order to store large amount of run-time states. This paper focuses on the design of a highly efficient context mechanism for multiplexing a high-throughput packet processing engine by multiple packet flows. Using a high-performance REM solution [6] on FPGA as the example packet processing engine, our design allows the original single-flow REM solution to be multiplexed with single- context- overhead. Specifically, our main contributions are as follows: We propose a design for REM circuit to utilize off-chip context memory. We propose a deeply-overlapped schedule to manage the context and reduce the ing overhead on the REM throughput. We give a detailed implementation and performance evaluation to demonstrate high throughput. The paper is organized as follows. Section II introduces the background of our problem as well as prior work. Section III gives in detail the proposed architecture and the context management schedule. The performance is evaluated in Section IV. Finally Section V concludes the paper. II. BACKGROUND AND PRIOR WORK A. Regular Expression Matching (REM) A regular expression (regex) defines a regular language over a fixed alphabet. Given a regex r and a sequence of input characters s = [x, x 1,...], regular expression matching /12/$ IEEE 7

2 (REM) of r against s is the process of finding and reporting any substring of s which is a member of L(r), the regular language defined by r. In general, REM can be performed with a set of regexes {r, r 1,..., r m 1 }, where all regexes are matched against the input packet flow concurrently. A typical construction for the hardware-based regular expression matching engine (REME) uses non-deterministic finite automaton (NFA), utilizing the massively parallel and reconfigurable logic resources on FPGA to achieve high throughput [3, 5, 1, 11, 12]. Input char. On-chip circuit pipeline_ pipeline_1 pipeline_u-1 stage (,v-1) stage (,1) stage (,) stage (1,v-1) stage (1,) off-chip context memory stage (u-1,) results B. Multi-flow REM problem A packet flow is a sequence of packets sent from a particular source to a particular destination. Since the same network link is usually traversed by over thousands of packet flows, multiflow REM needs to be performed at the router. In multi-flow REM, all the regexes are matched against multiple input packet flows coming from the network interface. Although the input to the entire REM system consists of k interleaved packet flows {s, s 1,..., s k 1 }, all m regexes {r, r 1,..., r m 1 } are matched against each packet flow individually. Any match output associated with a specific regex is clearly identified by the corresponding input packet flow number (between and k 1) in which the match occurs. However, large numbers of packet flows and regexes may consume a lot of on-chip resources, and require highbandwidth access to memory to store and retrieve a large number of states online. As a consequence, it remains a challenge to dynamically updating all states during run-time. C. Prior work 1) Single-flow REM: From the hardware s point of view, the implementation of NFA-based REM was first studied by Floyd and Ullman in [13]. Later in [11], an algorithm was proposed to translate an arbitrary regular expression directly into its matching circuit on FPGA. [5] proposed a tree structure where character inputs are pipelined and broadcast. Automatic REM circuit construction in VHDL was proposed in [3] and [1], where the NFA structure at the circuit level is later used by most other implementations [3, 4, 5, 1]. Several techniques were proposed to improve the circuit and enhance the performance. Among them, [12] proposed an algorithm to construct multi-stride NFA. [3] proposed an approach which uses shift-register lookup tables (SRL) for implementing single-character repetitions. However, all of these ideas involving NFA-based REM require large size of state status to be stored. [6] proposed an efficient algorithm to construct the NFA structure for single-flow REM. The resulting NFA circuit was mapped into several modules, and multiple modules of the same structure were stacked together to match multiple characters per clock. The resulting circuit structure was 2-dimensional pipelined, with the character input propagating along different pipelines horizontally and along multiple stages vertically. The total size of state bits can be very large. Figure 1: Overall Architecture 2) Multi-flow REM: The first paper on multi-flow REM appeared in [1], where a parallel architecture of matching engine for regular patterns was proposed. For each single packet flow, the system matches the input packet flow individually, each by using an independent NFA-based matching subsystem. This approach is commonly referred as the individual solution to the multi-flow REM problem. However, since each packet flow gets its own designated resource, the solution does not scale well with respect to the number of packet flows. As depicted in [9], multiplexing solution is another option to multi-flow REM problem, where multiple packet flows share a single REM system composed of several REME. In this approach, multiple packet flows are time-multiplexed first outside of the NFA-based REM circuit. A multiplexer selects a single packet flow as the input to the REM circuit each time, and es to another packet flow after the current status of the states has been recorded. Since multiple packet flows can share the resources, the scalability issue is mitigated. In [9], context memory was instantiated by utilizing onchip distributed RAM of FPGA because of its maximum bandwidth among all types of memory. However, as the number of packet flows or regexes becomes larger, another obvious concern comes into picture: the memory size can no longer support that many packet flows and regexes. In reallife scenarios too many packet flows (>>1) and too many regexes (>1, each with ~1 state bits) need to be dealt with, while the maximum on-chip distributed RAM size is below 9Mb, making it impossible to store all contexts onchip. As a result, the shortcoming of the proposed design in [9] intuitively initiates the intrinsic motivation of this paper. III. CONTEXT SWITCH USING OFF-CHIP MEMORY The multiplexing of the REM circuit is facilitated by the context ing operation. A context, possibly consisting of millions of bits, represents the states of the REME corresponding to a particular packet flow at a specific byte offset. A. Overall Architecture Unlike in [9], where the high-bandwidth on-chip distributed RAM is used for context memory, in this work we focus on designing the context mechanism using the highcapacity off-chip memory. This allows us to multiplex the 71

3 STAGE (from the previous stage) context State buffer (to the next stage) On-chip circuit pipeline_ pipeline_1 pipeline_u-1 stage (,v-1) stage (1,v-1) REME Characters State registers Transition logic State registers Transition logic State registers Transition logic Figure 2: Stage organization stage (,1) 2 nd (uv) th stage (,) stage (1,) stage (u-1,) 1 st loading off chip context memory (a) Loading context high-performance REM circuit for a larger number of packet flows. The overall architecture is shown in Figure 1, where we arrange all the REME in a 2-dimensional array. We define the number of pipelines as u and the number of stages per pipeline as v, so the total number of stages is (u v), where each stage can consist of n REME. Further, each stage in Figure 1 is marked uniquely with a pair of numbers. When a specific packet flow is selected by the off-chip control circuit, the character input is propagated along pipelines and stages in a pipelined fashion as in [6]. To concurrently match against multiple packet flows, the original architecture of the stage proposed in [6] has to be modified accordingly. For a particular stage in Figure 1, a state buffer is attached to support context mechanism. The organization of each stage is shown in Figure 2, where the state buffer is locally connected to the adjacent stages as discussed in Section III-B1. B. Context Switch Mechanism 1) Context access: The context access order is scheduled in a snake-like linear array as shown in Figure 3, where the contexts propagate along the direction in a pipelined manner as the arrows indicate. With even number of pipelines implemented, only the first stage (,) and the last stage (u 1,) are directly connected to the off-chip memory. The context access datapath forms a ring structure, which is different from the datapath of input characters. The proposed context access order has the following properties: The contexts of all stages are loaded from (or offloaded to) the off-chip memory in the reverse order of the snakelike arrows, i.e., stage (u 1,), stage (u 1,1),... stage (u 1,v 1), stage (u 2,v 1), stage (u 2,v 2),... stage (,). During load time, the context of each stage is first loaded to the state buffer at stage (,), then shift through the state buffers of all stages following the snake-like arrows in Figure 3a until reaching its destination. It requires (u v) s to load the context of stage (u 1,) into its state buffer, and the total load time of all stages is (u v) s. During offload time, the contexts of each stage first shift through the state buffers following the snake-like arrows in Figure 3b until reaching the state buffer at stage (u On-chip circuit pipeline_ pipeline_1 pipeline_u-1 stage (,v-1) 2 nd stage (,1) 1 st stage (,) stage (1,v-1) stage (1,) offloading off chip context memory (b) Offloading context (uv-1) th stage (u-1,) (uv) th Figure 3: Context access schedule 1,), then is offloaded to the off-chip memory. It requires (u v) s to offload the context of stage (, ) into the off-chip memory, and the total offload time of all stages is (u v) s. 2) Context : After all stages have received the corresponding contexts from the off-chip context memory, for each stage as shown in Figure 2, the next context saved in the state buffer and the current context recorded in the state registers can be swapped in a single clock. For different stages, an efficient way to contexts is to pipeline the ing control signals along with the character input in a 2-dimensional architecture, resulting a diagonal waveform-like propagation for the context as shown in Figure 4. Specifically, in the first after the completion of context access, only stage (,) is ing to the next packet flow while halting the REM (otherwise the context in the state registers will be destroyed), and all other stages stick with the REM for the current packet flow; in the second, both stage (,1) and stage (1,) are ing while halting the REM, and stage (,) starts its new REM for the next packet flow; the ing process will propagate along the diagonal of the 2-dimensional array until the stage (u 1, v 1) is ed in the (u + v 1)th. The proposed context order has the following properties: Context occurs after loading the next context, and 72

On-chip circuit pipeline_ pipeline_1 pipeline_u-1 stage (,v-1) stage (1,v-1) v th (v+1) th (u+v 1 ) th flow i flow (i+1) (uv) s matching load ( u+v-1 ) s (uv) s offload matching offload min.

4 On-chip circuit pipeline_ pipeline_1 pipeline_u-1 stage (,v-1) stage (1,v-1) v th (v+1) th (u+v 1 ) th flow i flow (i+1) (uv) s matching load ( u+v-1 ) s (uv) s offload matching offload min. matching time= (uv+u+v-2) s stage (,1) stage (,) stage (1,) stage (u-1,) 1 st 2 nd u th Figure 4: Context order before storing the current context. When ing context from flow i to flow i + 1, all stages have to after matching to the same character offset in flow i. The context of all stages can be ed in the same order as the propagation of the input character as shown in Figure 4. 3) Context update schedule: Since the bandwidth between the on-chip circuit and the off-chip memory usually supports multiple REME to load and offload contexts, a stage can have multiple REME (n) so that each stage can read and write contexts in a single clock. An example of context update schedule for three packet flows and the entire REM circuit is shown in Figure 5. We have the following observations: The load and offload time can be overlapped assuming the off-chip memory supports concurrent read-and-write access. The context load or offload time for the whole REM circuit is (u v) s due to the snake-like linear array in Figure 3, resulting in a minimum matching time of q = (u v + u + v 2) s during which the REM should not be interrupted by ing. Alternating reads and writes can be slow for some types of memory. If load and offload time cannot be overlapped, then there must be (2 u v) s between context es, resulting in a minimum matching time of (2 u v + u + v 2) s. The ing lasts for (u + v 1) s. The method mentioned in Section III-B2 leads to the stepping slopes during context as shown in Figure 5. The context in each stage takes only 1 off the REM process, resulting in a single- context overhead. A. Experimental setup IV. PERFORMANCE EVALUATION We conducted experiments using Xilinx ISE 13.1 targeting Virtex 6 HX-565T FPGA chip. The on-chip circuit can be configured as 8 pipelines, 8 stages per pipeline and 4 REME 1 per stage to improve area efficiency. Multiple on-chip circuits 1 Each REME corresponds to a single regex. flow (i+2) load matching Figure 5: Context update schedule were constructed for different number of regexes to match single or multiple characters per. To simplify the experiments, we only consider DDRII SRAM as the off-chip context memory. In practice, the proposed context scheme can be applied to systems with various types of off-chip memory. We used 6 parallel SRAM modules (each 1M 36bits, 3 MHz DDRII, dual port access) as the context memory. Since the off-chip memory access bandwidth is limited, the maximum total number of regex states that can be accessed for a single stage is bounded. With 4 REME per stage, the memory access bandwidth can support concurrently 1 stage to read and 1 stage to write, each of up to 432 bits per. The control circuit and multiplexer outside the REM circuit were excluded from the on-chip design. The targeted device was synthesized (xst) and place-and-routed (par) with either the maximize speed or minimize area option. Post-place-and-route results are reported. We used a fixed set of regexes extracted from Snort-rules (published in February 21) [1]. Regexes consisting of large number of states (> 18 states per regex) were excluded. Note that our REME design methodology as well as the proposed architecture can handle larger number of states. For fair evaluation of the proposed multi-flow REM, regexes that are too short (< 1 states) were also omitted. Our implementation prototype consists of a set of 256 regexes with in average 72 states per regex, while the longest regex we instantiated consists of 96 states. B. Evaluation results The access bandwidth and the size of off-chip memory are fixed in the experiments. The total memory size in the experiments is able to hold different contexts for 1 million concurrent packet flows. Using more copies of off-chip SRAM modules can support even larger number (over millions) of packet flows. Assuming there are sufficient off-chip resources, we conducted experiments mainly on the following parameters: Total number of regexes (n u v) Number of input characters per (m) Context period (p) (see Section IV-B3) For each parameter, we analyze its influence on the following metrics: Clock rate and throughput On-chip resource usage 73

5 Throughput (Gbps) Throughput Min. matching period Number of REME Figure 6: Throughput vs. number of REME (1-character input) Clock rate (MHz) clock rate throughput No. of input characters per Figure 7: Throughput vs. no. of input characters (256 REME) Min. matching period (s) Throughput (Gbps) No. of occupied slices No. of occupied slices occupied slices I/O pins Number of REME (a) Number of REME occupied slices I/O pins No. of input characters per (b) Number of input characters (256 REME) Figure 8: Resource consumption Number of used I/O pins Number of used I/O pins We represent the experimental results in Figure 6, Figure 7 and Figure 8. 1) Number of regexes: The total number of regexes in our design can be factored into three variables- Number of pipeline (u) Number of stages per pipeline (v) Number of REME per stage (n) We constructed several circuits to examine the effect of varying the number of regexes. Each circuit was organized as a fixed number of (n = 4) REME per stage since the number of REME per stage depends on the context access bandwidth. Because varying the number of pipelines is similar to varying the number of stages for the 2-dimensional array in Figure 1, we configured our circuits to have u 8 pipelines, each pipeline having a fixed number of (v = 8) stages. By varying the number of pipelines (u) only, we present the experimental results in Figure 6. a) Clock rate and throughput: The clock rate and the throughput nearly have a linear relationship despite a small negligible term (context overhead), so the curve of clock rate is omitted in Figure 6. The trend of the blue curve (throughput) in Figure 6 is not overwhelmingly influenced by a large number of regexes; our design can achieve a throughput of 2.17 Gbps for 256 REME. For 256 REME circuit (8-by-8 array), the minimize area option rather than the maximize speed of the targeted device was used for place-and-route (par) results. Therefore we expect the clock rate to drop when optimizing the area efficiency as shown in Figure 6. By increasing the total number of regexes, the minimum matching time also increases as shown in Figure 6, which will be discussed later. b) On-chip resource usage: The number of occupied slices is measured, since it indicates the area we used on FPGA. We also measure the number of used I/O pins because it may become a constraint when we only have limited number of I/O pins on-chip. As shown in Figure 8a, we notice- The number of occupied slices or used I/O pins increases linearly with the number of regexes, because more pipelines have to be implemented to accommodate more REME; Since each pipeline has the same number of stages and REME, the resource (slices and I/O pins) consumption has a linear relationship with the number of REME; The number of used I/O pins becomes the most consumed resource, since a large amount of I/O pins have to be used as the memory interface connecting to the off-chip context memory. 2) Number of input characters per : To enhance the throughput, we implemented the multi-character matching mechanism using the method proposed in [6] in our design. Specifically, a single-character REME takes a single character (8 bit) as input per ; by stacking the same REME together 74

6 Table I: Multi-flow vs. Single-flow (8-character, 256 REME) REM Clock Throughput Min. match. Occupied I/O (MHz) (Gbps) time (s) slices pins Single any flow (16%) (14%) Multi flow (16%) (37%) and removing redundant registers, a multi-character REME can be constructed, where multiple characters can be taken as input to the on-chip circuit per clock. The resulting circuit can have longer routing paths, which affects the clock rate negatively. However, since the m-character REME can match more characters (8 m bits) per, a higher throughput can be achieved. a) Clock rate and throughput: The green and orange lines in Figure 7 indicate, respectively, the achievable clock rate and throughput of our multi-flow REM design with respect to the number of input characters. We also implemented the single-flow multi-character REM, utilizing the same 2- dimensional structure as shown in Figure 1. We have the following two observations: As shown in Figure 7, when more characters are matched per, the clock rate of multi-flow REM decreases due to longer routing paths in each REME; however, the throughput increases due to multi-character input. As shown in Table I, compared with the single-flow multicharacter REM, the clock rate of our design slightly differs from the single flow REM (by <4%), yielding effectively the same throughput. b) On-chip resource usage: The on-chip resources consumed by 256 REME circuit for multi-flow REM is listed in Table I, compared with the resource usage of single-flow REM system. As shown in Table I, we need slightly more on-chip resources to implement the multi-flow REM. We used a lot of I/O pins in the implementation of multi-flow REM. As shown in Figure 8b, to match m characters per, we need to stack m copies of the same REME together, resulting in a sublinear increase of resource consumption with respect to increasing number of input characters. 3) Context period: The context period indicates the period of the context es in our design. a) Clock rate and throughput: As discussed in Section III-B3, the minimum matching time between two context es in a particular stage is q s. During this period of time, continuous data from a packet flow should be fed into the on-chip circuit. A small value of the minimum matching time is desirable because- If p q + 1, then our REM system can achieve the maximum matching throughput. With sufficiently large p, we only have negligible throughput degradation. If p < q + 1, then the REM system achieves lower throughput due to the idle s during which the system has to wait for the context load-offload process to complete. In general, if we denote the single-flow throughput as T{ and the multi-flow throughput as T, then we have T = p 1 p T, if p q + 1 p 1 q+1 T where q = u v + u + v 2., if p < q + 1 b) On-chip resource usage: The designed context period has no influence on the resource consumption. V. CONCLUSION In this paper, we studied the multiplexing solution to REM problem for over thousands of concurrent packet flows. With an off-chip context memory and an extension to the singleflow REM architecture, we developed an implementation of multi-flow REM to support a large number of packet flows and a large number of regexes on FPGA. The same approach can be used in any packet processing for multiple network flows whenever a large number of state bits are involved. REFERENCES [1] SNORT, [2] Bro Intrusion Detection System, [Online]. Available: [3] J. Bispo, I. Sourdis, J. M. P. Cardoso, and S. Vassiliadis, Regular expression matching for reconfigurable packet inspection, in Proc. IEEE Intl. Conf. on Field Programmable Technology (FPT), December 26, pp [4] C. Clark and D. Schimmel, Scalable pattern matching for high speed networks, in Proc. IEEE Sym. on Field-Programmable Custom Computing Machines (FCCM), April 24, pp [5] B. L. Hutchings, R. Franklin, and D. Carver, Assisting Network Intrusion Detection with Reconfigurable Hardware, in Proc. IEEE Sym. on Field-Programmable Custom Computing Machines (FCCM), 22, p [6] Y.-H. E. Yang, W. Jiang, and V. K. Prasanna, Compact Architecture for High-Throughput Regular Expression Matching on FPGA, in Proc. ACM/IEEE Sym. on Architectures for Networking and Communications Systems (ANCS), November 28. [7] Minnesota Internet Traffic Studies (MINTS), [8] Y.-H. E. Yang and V. K. Prasanna, Automatic Circuit Construction for Large-Scale Regular Expression Matching on FPGA, in Proc. Intl. Conf. on ReConFigurable Computing and FPGAs, 28. [9] Y. Qu, Y.-H. E. Yang, and V. K. Prasanna, Multi-stream Regular Expression Matching on FPGA, in Proc. Intl. Conf. on ReConFigurable Computing and FPGAs, December 211. [1] A. Mitra, W. Najjar, and L. Bhuyan, Compiling PCRE to FPGA for accelerating SNORT IDS, in Proc. ACM/IEEE Sym. on Architecture for Networking and Communications Systems (ANCS), New York, NY, USA, 27, pp [11] R. Sidhu and V. Prasanna, Fast Regular Expression Matching Using FPGAs, in Proc. IEEE Sym. on Field-Programmable Custom Computing Machines (FCCM), 21, pp [12] N. Yamagaki, R. Sidhu, and S. Kamiya, High-Speed Regular Expression Matching Engine Using Multi-Character NFA, in Proc. Intl. Conf. on Field Programmable Logic and Applications (FPL), Aug. 28, pp [13] R. W. Floyd and J. D. Ullman, The Compilation of Regular Expressions into Integrated Circuits, Journal of ACM, vol. 29, no. 3, pp ,

Automation Framework for Large-Scale Regular Expression Matching on FPGA. Thilan Ganegedara, Yi-Hua E. Yang, Viktor K. Prasanna

Automation Framework for Large-Scale Regular Expression Matching on FPGA Thilan Ganegedara, Yi-Hua E. Yang, Viktor K. Prasanna Ming-Hsieh Department of Electrical Engineering University of Southern California