Demystifying Automata Processing: GPUs, FPGAs or Micron s AP?

Size: px

Start display at page:

Download "Demystifying Automata Processing: GPUs, FPGAs or Micron s AP?"

Grace Boone
5 years ago
Views:

1 Demystifying Automata Processing: GPUs, FPGAs or Micron s AP? Marziyeh Nourian 1,3, Xiang Wang 1, Xiaoong Yu 2, Wu-chun Feng 2, Michela Becchi 1,3 1,3 Department of Electrical an Computer Engineering, 2 Department of Computer Science 1 University of Missouri, 2 Virginia Tech, 3 North Carolina State University mnouria@ncsu.eu, xw7b4@mail.missouri.eu, xyu@vt.eu, feng@cs.vt.eu, mbecchi@ncsu.eu ABSTRACT Many establishe an emerging applications perform at their core some form of pattern matching, a computation that maps naturally onto finite automata abstractions. As a consequence, in recent years there has been a substantial amount of work on high-spee automata processing, which has le to a number of implementations targeting a variety of parallel platforms: CPUs, GPUs, FPGAs, ASICs, an Network Processors. More recently, Micron has announce its Automata Processor (AP), a DRAMbase accelerator of non-eterministic finite automata (NFA). Despite the abunance of work in this omain, the avantages an isavantages of ifferent automata processing accelerators an the innovation space in this area are still unclear. In this work we target this problem an propose a toolchain to allow an apples-to-apples comparison of NFA acceleration engines on three platforms: GPUs, FPGAs an Micron s AP. We iscuss the automata optimizations that are applicable to these three platforms. We perform an evaluation on large-scale atasets: to this en, we propose an NFA partitioning algorithm that minimizes the number of state replications require to maintain functional equivalence with an unpartitione NFA, an we evaluate the scalability of each implementation to both large NFAs an large numbers of input streams. Our experimental evaluation covers resource utilization, traversal throughput, an preprocessing overhea an shows that the FPGA provies the best traversal throughputs (on the orer of Gbps) at the cost of significant preprocessing times (on the orer of hours); GPUs eliver moest traversal throughputs (on the orer of Mbps), but offer low preprocessing times (on the orer of secons or minutes) an goo pattern ensities (they can accommoate large atasets on a single evice); Micron s AP elivers throughputs, pattern ensities, an preprocessing times that are intermeiate between those of FPGAs an GPUs, an it is most suite for applications that use atasets consisting of many small NFAs with a topology that is fixe an known a priori. 1 INTRODUCTION Many establishe an emerging applications perform at their core some form of pattern matching, a computation that maps Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. Copyrights for components of this work owne by others than ACM must be honore. Abstracting with creit is permitte. To copy otherwise, or republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. Request permissions from Permissions@acm.org. ICS '17, June 14-16, 2017, Chicago, IL, USA 2017 Association for Computing Machinery. ACM ISBN /17/06 $ naturally onto finite automata abstractions. In biology, for example, several genomics tasks, such as motif iscovery, orthology inference, shotgun an e novo assembly, involve string-matching operations on genomics ata. In turn, avances in DNA sequencing technology have le to increasingly large volumes of ata available for these applications, resulting in a significant increase in their computational requirements. In the networking omain, several applications such as network intrusion etection, content-base routing, an application-level filtering require inspecting network packets for potentially large sets of preefine patterns, an they typically must perform this operation at the rate of packet arrival on the router interface. Given the number an relevance of applications requiring efficient pattern matching, there has been a substantial amount of work on high-spee automata processing, an this work has originate from ifferent communities: from the networking to the reconfigurable computing an computer architecture to the parallel computing community. These efforts have le to a number of algorithmic [1-9] an architectural solutions targeting ifferent parallel platforms: from CPUs to GPUs [10-12] to FPGAs [13-16] to ASICs [17-19] to Network Processors [20]. More recently, Micron has announce their Automata Processor [21], a DRAM-base accelerator of non-eterministic finite automata (NFA) that has been showcase on a variety of applications: motif iscovery in biological sequences [22], association rule mining [23], brill tagging [24], high-spee regular expression matching for network intrusion etection [25], graph processing [26], an sequential pattern mining [27]. Despite this abunance of work on high-spee automata processing, there is still lack of clarity as to how existing software an harware solutions are relate to an compare with each other. There are several reasons for this. First, existing solutions are base on ifferent automata moels: either non-eterministic or eterministic finite automata (NFAs an DFAs, respectively). While functionally equivalent, NFAs an DFAs have practical ifferences in terms of resource requirements an traversal behavior that are strongly epenent on the characteristics of the unerlying pattern set. While there has been a substantial boy of work proposing automata esigns that trae off the avantages an isavantages of NFAs an DFAs [1-9], no automata moel is preferable on all atasets. This makes it har to provie a fair comparison between automata processors relying on ifferent automata moels. Secon, some automata processing architectures are esigne to optimize the peak performance of a single input stream, while others offer better support for stream-level concurrency. Thir, applications relying on finite automata must operate

2 (a) 0 a 1 a b c 4 5 c 7 b 2 8 c e a: from 1-10 (b) b 1 2 remaining a transitions b: from b c 4 5 c: from 1,3,5-10 in two steps: in the preprocessing step, the require automaton must be generate, optimize, compile, an loae onto the target accelerator (through memory configuration an/or place&route operations); in the traversal step, the application performs pattern matching by traversing the automaton guie by the content of the input text. Most of the existing automata processing engines have been esigne to optimize automata traversal, often at the cost of a significant preprocessing cost. While the preprocessing time is unimportant for some categories of applications (for example, network intrusion etection systems can operate for ays or weeks between reconfigurations of their pattern sets), its effect on performance can be significant for other applications with more ynamic pattern sets or traversal times in the orer of a few secons. Unfortunately, the majority of the previous stuies on Micron s AP neglect to report the preprocessing overhea (or part of it) [21-24] or report substantial speeups over preexisting CPU tools (not necessarily base on automata) by comparing the full execution time of these tools to only the traversal time of the automatabase solution (on the orer of secons or millisecons), omitting the preprocessing time of the AP-esign (on the orer of minutes) in the speeup calculation [25]. This can lea to results that are misleaing or of limite practical use. To target these problems an provie an apples-to-apples comparison, we select automata accelerator esigns that rely on the same automata moel: NFAs. Since NFAs o not suffer from state explosion, their use allows us to perform an evaluation on large-scale atasets without posing any restrictions on the kin of patterns supporte. Specifically, we compare GPU- an FPGAbase NFA engines with Micron s AP. Micron s AP extens NFAs functionality with counters an boolean elements. To ensure functional equivalence an the same egree of programmability across the consiere platforms, we exten existing FPGA- an GPU-base esigns to support these features, an we aopt the same programming interface for all platforms: namely, Micron s Automata Network Markup Language (ANML). Different platforms offer ifferent automata ensity to take this into account, we perform an analysis on non-trivial ataset sizes, which require partitioning large NFAs across multiple evices. Besies consiering peak performance on a single input stream, we evaluate the scalability of the consiere automata processor esigns to multiple concurrent inputs. Finally, we evaluate the costs of the ifferent preprocessing steps require by the consiere architectures, an we stuy how the size of the automaton an the ensity of its transitions affect some of the preprocessing stages (e.g., place&route on Micron s c 8 9 c e 3 6 e Figure 1: (a) NFA an (b) DFA accepting regular expressions a+bc, bc+ an ce. Accepting states are bol. States active after processing text aabc are colore gray AP an FPGA). To summarize, we make the following contributions: We exten existing FPGA- an GPU-base automata processing esigns to support Micron s AP counters an boolean elements, an we propose a compiler toolchain to automatically eploy extene NFAs (in ANML form) onto these three platforms. We propose an NFA partitioning scheme aime at minimizing the amount of state replication require to hanle large NFAs while preserving functional equivalence with a single unpartitione NFA. For GPU eployment, we explore ifferent state layouts an kernels suite to NFAs with varying characteristics. We perform an apples-to-apples comparison between Micron s AP, GPU- an FPGA-base NFA accelerator esigns on large-scale atasets. Our evaluation covers resource utilization, throughput an preprocessing costs for real-worl NFAs use in networking an bioinformatics applications, as well as synthetic atasets covering regular expressions atasets with various characteristics. 2 BACKGROUND AND RELATED WORK 2.1 Backgroun on Automata Processing Regular expression matching has traitionally been implemente by representing the pattern-set through finite automata (FA) [28]. The matching operation is equivalent to a FA traversal guie by the content of the input stream. Worstcase performance guarantees can be offere by bouning the amount of processing performe per input character. However, techniques to keep per-character processing low involve increasing the size of the finite automaton, the basic ata structure in the regular expression matching engine. As the size of pattern-sets an the expressiveness of iniviual patterns increase, limiting the size of the automaton to fit on reasonably provisione harware platforms becomes challenging. Thus, the exploration space is characterize by a trae-off between the size of the automaton an the worst-case boun on the amount of per character processing. NFAs an DFAs are at the two extremes in this exploration space. NFAs have a limite size but can require expensive percharacter processing, whereas DFAs offer limite per-character processing at the cost of a possibly large automaton. In Figure 1 we show the NFA an DFA accepting three simple patterns (a + bc, bc + an ce). In the figure, states active after processing text aabc are colore gray. In the NFA, the number of states an transitions is limite by the number of symbols in the patternset. In the DFA, every state presents one transition for each character in the alphabet ( ). Each DFA state correspons to a set of NFA states that can be simultaneously active [28]; therefore, the number of states in a DFA equivalent to an N-state NFA can potentially be 2 N. In practice, previous work [2, 5, 29] has shown that this so-calle state explosion happens only in the presence of complex patterns (typically those containing repetitions of large character sets). Since each DFA state correspons to a set of simultaneously active NFA states, DFAs

3 Figure 2: Our toolchain ensure minimal per-character processing (only one state transition is taken for each input character). From an implementation perspective, existing regular expression matching engines can be classifie into two categories: memory-base [1-12, 17, 19], an logic-base [13-16]. Within the former, the FA is store in memory; within the latter, it is store in combinational an sequential logic. Memory-base implementations can be eploye on various platforms (GPUs, network processors, ASICs, FPGAs); logic-base implementations typically target FPGAs. In a memory-base implementation, esign goals are the minimization of the memory size neee to store the automaton an of the memory banwith neee to operate it. Similarly, in a logic-base implementation the esign aims at minimizing the logic utilization while allowing fast operation (that is, a high clock frequency). Existing proposals targeting DFA-base, memorycentric solutions have focuse on esigning compression mechanisms to reuce the DFA memory footprint an novel automata to alleviate the state explosion problem [1-9]. Despite the complexity of their esign, memory-centric solutions have three avantages: fast reconfigurability, low power consumption, an scalability in the number of input streams. On the other han, logic-centric solutions allow for easily achieving peak worst-case performance on a single input stream, at the expense of lack of scalability in the number of concurrent inputs. 2.2 Micron s Automata Processor Overview Micron's Automata Processor [21] is a DRAM-base, reconfigurable accelerator that simulates NFA traversal at high spee. The AP inclues three kins of programmable elements store in SDRAM: State Transition Elements (STE), Counter Elements (CE) an Boolean Elements (BE), which implement states/transitions, counters an logical operators between states, respectively. Each STE inclues a 256-bit mask (one bit per ASCII symbol), an symbols triggering state transitions are associate to states (an encoe into STEs) rather than to transitions. Transitions between states are then implemente through a routing matrix consisting of programmable switches, buffers, routing lines, an cross-point connections. The routing capacity is limite by traeoffs between clock rate, propagation elays an power consumption, an these constraints influence place&route of automata onto the AP harware. Micron's current generation of AP boar (AP-D480) inclues 16 or 32 chips organize into two to four ranks (8 chips per rank), an its esign can scale up to 48 chips. Each AP chip consists of two half-cores. There are no routes either between half-cores or inter-chips, which implies that NFA transitions across half-cores an chips are not possible. Programmable elements are organize in blocks: each block consists of 16 rows, where a row inclues eight groups of two STEs an one special purpose element (CE or BE). Each chip contains a total of 49,152 STEs, 768 CE an 2,304 BE, organize in 192 blocks an equally resiing in both half-cores. Current boars allow up to 6,144 elements per chip to be set as report elements. AP automata can be escribe in ANML (an XML-base language). Recently propose high-level programming languages for the AP are mappe an compile into ANML [30]. Micron s SDK inclues a toolchain that parses ANML esigns, compiles them into internal objects consisting of subgraphs, places an routes these subgraphs onto the AP harware, an finally generates a binary image that can be use to program the AP memory an routing matrix. Once the AP has been programme, it will be able to simulate the NFA traversal. AP chips can be groupe into logical cores of 2, 4 or 8, each processing a stream of 8-bit input characters [25]. The AP nominally operates at a 133MHz frequency, an, in absence of matches, it processes one input character per clock cycle from all input streams. Once matches occur, AP generates reporting events in vector format an stores them in an output-buffer; reporting matches to the host system requires from 91 to 291 clock cycles. 3 TOOLCHAIN 3.1 Overall esign Figure 2 shows the toolchain esigne to eploy ANML specifications on GPU, FPGA an Micron s AP. In the figure, grey boxes represent the software components that we have esigne an implemente. The last two moules leaing to FPGA an AP are Xilinx an Micron s software evelopment kits use for the final synthesis/compilation, map, an place&route on these two evices. The input to the toolchain is an ANML file that contains one or more automata networks (each incluing one or more NFAs). We on t impose any constraints on these networks: in other wors, they on t nee to be esigne to fit a particular evice or optimize for it. Once parse, these networks are store in our toolchain using an internal representation for later processing an optimization. We istinguish two categories of optimizations: automata-specific an platform-specific. Since the GPU, FPGA an AP are use as NFA traversal accelerators, optimizations to the automaton apply to all platforms. In our previous work [14], we have escribe several NFA optimizations (state reuction, alphabet compression an software striing) an put them to practice on FPGA; these optimizations apply to GPUs an AP as well. Automata-specific optimizations can be selectively enable an isable. Platform-specific optimizations are relate to the way the NFA is encoe for the particular target evice; these optimizations inclue compact an efficient memory encoings, logic utilizations, an striing mechanisms that are specific to a particular harware platform. Since the internals of the operation of the AP harware an its software stack (incluing the compilation, map an place&route processes) are proprietary, AP-specific optimizations are

4 Figure 3: NFA accepting regular expressions ab+[c]e an corresponing one-hot encoing representation eferre to the AP SDK tools (last phase of the toolchain). The partitioning step, that takes a potentially large network an breaks it into multiple NFA partitions so that each of them can fit the target harware, is performe after the automata-specific optimization step. This allows partitioning to be one on an alreay optimize NFA. Our partitioning algorithm is platforminepenent, but its configuration epens on the target platform. The coe an configuration generation step prouces the files require for the final eployment of the automata network on the harware. For GPUs, all is neee is a configuration file that inclues the information necessary to loa the NFA partitions into memory, an a heaer file with the efinition of the boolean connectors in the ANML specification. FPGAs are configure through a Verilog file escribing the NFA network an its interface. The AP is configure through an ANML file; this output file iffers from the input file in that it contains a partitione an optimize automata network. 3.2 GPU implementation We reuse an exten infant [18], an NFA-traversal engine for GPUs. infant stores the NFA in evice memory, an encoes the transition table as set of (source, estination) pairs inexe by the input character. In orer to allow efficient execution, infant stores the set of active states in share memory in bitvector form. For each input character, infant retrieves from memory all the transitions on that symbol, an, if their source state is active, the engine upates the active state vector with the estination state information. In infant, each threa-block is assigne an input stream, an threas within a block process the state transitions an upate the state vector cooperatively. We extene infant with the following functionalities: Support for multiple NFA partitions We map each NFA partition to a threa-block, allowing multiple blocks to process the same input stream on ifferent partitions. The transition lists corresponing to ifferent partitions are lai out sequentially, an an inexing array maps each partition to the proper set of threa-blocks, each operating on a ifferent input stream. Traversal kernels base on compresse sparse row (CSR) layout We consier an alternative memory layout where transitions represente as (input symbol, estination) pairs are inexe by the source state. For each input symbol, this layout allows processing only the transitions that originate from active states. We store the ientifiers of the active states in a queue in global memory. We consier two variants of this kernel: CSR-state an CSR-tx, the former mapping active states to threas, an the latter mapping outgoing transitions from active states to threas. Support for counters an boolean elements We associate a special state to each counter an boolean element, an store these special states at the en of the state vector. The activation of special states triggers coe implementing the operation of the particular counter or boolean element. Boolean operators are also associate combinational coe that is store in an automatically generate heaer file. 3.3 FPGA implementation On FPGA, NFA processing can be realize in two ways: either by implementing a traversal engine that accesses the NFA store in memory, or by irectly encoing the NFA in logic. Most logicbase NFA implementations are base on the one-hot encoing scheme [13], in which states are represente as flip-flops while transitions are implemente by an-ing an or-ing the outputs of the flip-flops with the ecoe input character. For example, Figure 3 shows the one-hot encoing representation of the NFA accepting regular expression ab + [c]e. The main avantage of this scheme is that it limits the traversal time to one clock cycle per input character inepenent of the number of states that are active (this property is share by Micron s AP). On the other han, this implementation suffers from two limitations: first, upating the NFA requires reprogramming the evice; secon, multiple input stream support requires logic replication. The pros an cons of a memory-base FPGA esign are comparable to those of a GPU solution: easy support for multiple input streams at the cost of irregular an unpreictable memory access patterns, leaing to ataset epenent performance. In this paper we use the optimize logic-base implementation that we have escribe in our previous work [14], an exten it to support counters an boolean elements (a trivial extension). 3.4 Automata-specific optimizations Our toolchain inclues three automata-specific optimizations: state reuction, alphabet compression an software striing [14]. Here, we briefly mention their effect on the consiere platforms. State reuction (which merges uplicate NFA paths) reuces the memory requirements on GPU an AP, an the logic requirements on FPGA. In aition, it reuces the number of states that can be active in parallel, which for GPU is beneficial to the throughput. Alphabet compression (which consoliates the alphabet base on the symbols appearing on the NFA transitions) reuces the wiring an LUT utilization on FPGA. However, because the AP stores a 256-bit mask in each STE, this optimization oes not benefit AP unless combine with software striing. Software striing (which allows processing multiple characters in one step) can be beneficial on all platforms if combine with alphabet reuction. This technique is applicable to the AP only if the alphabet generate by combining alphabet reuction an software striing oes not excee 256 symbols. GPUs an FPGAs offer also platform-specific striing schemes [6, 10, 15], which we have inclue in our toolchain. 3.5 Partitioning criteria An NFA must be partitione if it excees the resources

(a) Reference NFA (b) 1 st initial coloring step (c) 2 n initial coloring step () 3 r initial coloring step (e) Replication reuction step (f) Final consoliation Figure 4: Example of application of

5 (a) Reference NFA (b) 1 st initial coloring step (c) 2 n initial coloring step () 3 r initial coloring step (e) Replication reuction step (f) Final consoliation Figure 4: Example of application of our coloring scheme (N max =8). available on a particular evice. Here, we inicate the platformspecific partitioning criteria we use. In Section 4, we escribe our propose partitioning algorithm. GPU: GPU partitioning is require if the share or global memory capacity is exceee, or if the state ientifier space is exhauste. In this paper, we use 16-bit state ientifiers, leaing to a maximum of states per NFA partition. This constraint is more restrictive than those on the global an share memory capacity (an, ue to threa-block concurrency, is not a limiting factor on performance see Section 5). FPGA: The logic esign use stores states in flip-flops an transitions in LUTs. We experimentally foun flip-flops to be the bottleneck resource. Thus, we perform NFA partitioning when the number of NFA states excees that of available flip-flops. AP The AP oes not allow transitions across half-cores, an has a limite number of STEs, Counter Elements an Boolean Elements per half-core (see Section 2.2). Thus, the AP NFA partitioning criterion is base on these constraints. 4 NFA PARTITIONING ALGORITHM In this section, we escribe our NFA partitioning algorithm. For the sake of simplicity, we iscuss the algorithm on traitional NFAs: its extension to counters an boolean elements is straightforwar. In orer to preserve functional equivalence, NFA partitioning requires state replication. For example, let us assume to break the NFA of Figure 4(a) into two partitions to be eploye an operate on two evices: one partition containing states from 0 to 16, an the other containing states from 17 to 24. In orer to maintain functional equivalence with the original NFA, the entry state 0, which is share by the patterns matche in the both partitions, must be replicate into the secon partition. In general, very large NFAs may require replication of sets of states share by several patterns. The goal of our partitioning algorithm is to split the NFA into a small number of balance partitions, while minimizing the require state replications. In particular, given a threshol N max on the number of states that can be accommoate on a particular evice or harware component, the algorithm must split the NFA into as few partitions as possible, each with size not exceeing N max. Balance partitions allow loa balancing within (for GPUs an Micron s AP) an across (for FPGAs) evices, which ultimately has a positive effect on throughput. It is worth noting that existing partitioning schemes for generic graphs [31] aim to minimize the size of the cut (number of inter-partition transitions), but, when applie to NFAs, they o not necessarily minimize the number of state replications require to preserve functional equivalence with the unpartitione NFA. Thus, the nee for a partitioning scheme tailore to NFAs. We propose an algorithm that colors the NFA so that each color represents a partition, an states assigne multiple colors are share across partitions an must be replicate in each of these partitions. In orer to meet the requirements above, the algorithm must limit the number of colors an of states with multiple colors, while allowing each color to appear in up to N max states. In the following, we call color size the number of states assigne a particular color. Our algorithm operates in two phases: initial coloring an color consoliation. In the initial coloring phase, the NFA is traverse from the entry state an recursively colore until the size of each color oesn t excee N max. The color consoliation phase consoliates multiple colors into one while keeping their size below the given threshol. We note that sets of states connecte by cyclic transitions (e.g., states 2 an 3 in Figure 4(a)) cannot be separate into multiple partitions. We recall that, for partitions to be inepenent, inter-partition activations (that is, cross-partition transitions) must be avoie. As a consequence, a state belonging to multiple partitions must be replicate along with all the states connecte to it in a cyclic fashion. Thus, we group states that are cyclically interconnecte into super-states, an we hanle all the states in a super-state together. For example, states 2 an 3 of Figure 4(a) form super-state {2, 3} an are hanle as a single state. In orer to operate, the algorithm

6 Table 1: Dataset characteristics an traversal information (ranges correspon to traces with p forw =0.5 an p forw =0.9). Type Small NIDS Bioinformatics Synth. Name # states NFA Characteristics # Partitions Traversal Information # trans. # ANML states GPU FPGA AP Avg. active set Max. active set # Matches Inputs w/ matches % l7-filter k snort k 10k gene_8k 33k 100k 55k gene_12k 86k 258k 243k gene_16k 138k 415k 230k gene_20k 190k 570k 317k gene_8k 137k 413k 229k gene_12k 621k 1863k 1035k gene_16k 1124k 3372k 1873k gene_20k 1619k 4858k 2699k eep-64char 800k 1801k 800k eep-256char 800k 4855k 800.3k shallow-64char 800k 1801k 800.1k shallow-256char 800k 4855k 800.1k requires super-states to inclue fewer than N max states. If this is not the case, the NFA cannot be split into inepenent partitions. In the presence of epenent partitions, multiple NFA traversals are require to hanle inter-partition activations. Fortunately, NFAs originate from regular expressions atasets ten to have only few super-states of small size. This is because backwar-irecte transitions in NFAs originate from subpattern repetitions within regular expressions (for example, subpattern (c) * in Figure 4(a), where string c can be repeate zero or more times). Sub-pattern repetitions are rare in real-worl atasets, an are rarely share by a large number of patterns. We now etail the operation of the two phases of the algorithm, an illustrate them in Figure 4. In the example, we assume that the threshol N max is equal to 8. Initial coloring The initial coloring proceure starts by assigning istinct colors to the states connecte to the entry state (or to the super-state to which it belongs). This is illustrate in Figure 4(b), where the chilren of the entry state 0 are colore brown, green, yellow, pink, white, blue an orange. The colors are propagate to all the connecte states following the transitions. The entry state is then assigne all the colors of its chilren. As can be seen, this leas to some states (states 11-12, 14-16, 19-21, besie state 0) be assigne multiple colors. This operation must be repeate recursively on all generate NFA partitions until their size oes not excee the threshol N max. As can be seen, after the first coloring step the brown color has size 12 (states 0-9 an 11-12). Therefore, the coloring proceure is repeate starting from state 1. This causes color brown to be split into colors re an violet, which are again propagate own to the terminal states of the NFA (Figure 4(c)). Since color violet has size 10 (incluing state 0), the algorithm invokes one aitional recursive step on super-state {2,3}, causing color violet to be split into colors cyan an grey (Figure 4()). Since the largest color (grey) has now size 8 (equal to N max ), the initial coloring phase is terminate. Color consoliation While respecting the constraint on the maximum partition size, the partitioning generate by the initial coloring step has two limitations: it inclues small an unbalance partitions, an it leas to significant state replication. In the example, states 2, 3, 11, 14-16, must be replicate once, states 1 an 12 must be replicate twice, an state 0 must be replicate 8 times. The coloring consoliation phase aims to combine ifferent colors into one so as to increase the partition size, ecrease the number of partitions an the number of state replications require, an achieve more balance partitions. This phase is broken own into two steps: replication reuction an final consoliation. The first step aims to reuce the number of state replications require by merging colors. To etermine which colors to consoliate, we sort pairs of colors in escening orer accoring to the number of state replications that their consoliation woul save. In the example, cyan/grey, yellow/pink an white/blue woul save 3 state replications, green/re woul save 2, an re/yellow, green/yellow, re/cyan an re/grey woul save only 1. We then consier all pair-wise consoliation opportunities in orer, an merge the two colors only if their merging oesn t violate the partition size constraint. In the example, we consoliate yellow+pink into yellow, white+blue into white, an green+re into green. Figure 4(e) shows the result of the replication reuction step. In the final consoliation step, we look for opportunities to consoliate colors accoring to their size. To this en, we first sort the colors in escening orer by size, an then traverse the list an consoliate each color with the next color in the list that oesn t lea to violating the partition size constraint (if such color exists). In the example, colors orange an green are consoliate into orange. Figure 4(f) shows the final coloring, which leas to 5 partitions: two of size 8 (grey an orange) an three of size 6 (cyan, yellow an white). 5 EXPERIMENTAL EVALUATION 5.1 Harware platform We conucte almost all our experiments on a machine equippe with a ual 6-core Intel Xeon 2.66GHz an

7 Table 2: Resource utilization for GPU (ranges correspon to ifferent numbers of streams) an FPGA (ranges correspon to the minimum an maximum values across partitions) Type Small NIDS Bioinformatics Synt. 64GB of memory, running CentOS 6.4. Since some of our AP syntheses run out-of-memory on that machine, for our AP experiments we use a server with similar harware settings but equippe with 256GB memory. For our GPU experiments we use an Nviia Titan X GPU (Maxwell architecture), equippe with 12GB of global memory an 24 streaming multiprocessors (SMs), each incluing 128 cores an 96KB of share memory. We use CUDA 7.0. For our FPGA experiments we use a Xilinx XC6VLX130T evice (Virtex-6 family), which inclues 20,000 slices (for a total of 160,000 flip-flops an 80,000 LUTs). We use the Xilinx ISE Design suite v13.2 to perform synthesis, mapping an place&route of our HDL esigns. This FPGA evice was chosen because it is in the same price range (~$1,200) as our GPU. For AP experiments, we refer to the architecture of a 32- chip AP-D480. Since Micron s AP harware is not yet available on the market, we on t have pricing information for it. For the AP, we use AP SDK v to collect resource utilization an preprocessing ata, an performe throughput projections using the nominal operating frequency (more etails in Section 5.3). 5.2 Datasets GPU We selecte atasets allowing to compare the three platforms on ifferent application omains an on NFAs with varying characteristics in terms of number of states an transitions, alphabet size, connectivity an epth. To this en, we use three types on atasets: small NIDS, bioinformatics, an synthetic. Recently propose benchmark suites for automata processing [32] are not meant for large scale analysis (they inclue NFAs with up to about 100k states). Table 1 (columns 3-5) summarizes the characteristics of the NFAs for the consiere atasets. Small NIDS (Snort538 an l7-filter) are small network intrusion etection atasets that inclue 538 an 116 regular expressions, respectively (see [10] for more etails). Bioinformatics atasets (ngene_kk) consist of a set of Hamming istance automata use to aress a motif-fining problem [33]. The problem requires ientifying all the substrings FPGA # GPU Memory Utilization # FPGA % utilization Name evices share global (MB) evices FF LUT Slice (KB) infant CSR l7-filter snort gene_8k gene_12k gene_16k gene_20k gene_8k gene_12k gene_16k gene_20k eep-64char eep-256char shallow-64char shallow-256char of length k that appear on multiple genes within hamming istance, an can be foun in a region of the gene of length l. Due to space limitation, here we show only the results for n genes from a yeast genome of about 5000 genes, with n={10, 100}, k={8, 12, 16, 20}, l=500, =2 an a 4-symbol alphabet (A, C, G, T). A Hamming istance NFA has (k+1)(+1)-(+1)/2 states, an each gene region of length l leas to (l-k+1) of these NFAs. The NFAs in Table 1 (use on all three platforms) have been statereuce. However, previous work [22, 30] has shown that, on the AP, preprocessing time can be significantly reuce if NFAs with known structure are precompile. Thus, on the AP we also use a non state-reuce variant of these bioinformatics atasets (see Table 5), leaing to networks of n(l-k+1) small NFAs (each with (2+1)k- 2 STEs) with fixe topology. Synthetic automata exhibit the structure of NFAs accepting sets of regular expressions with share prefixes: large state outegrees in the proximity of the entry state, an low state outegrees as we move eeper in the NFA. Our synthetic NFAs have configurable number of states, alphabet size, entry state outegree, outegree ecrease factor γ (the outegree ecreases as γ epth ), an frequency of wilcars, character sets an their repetitions. We set these parameters so as to generate 800k-state NFAs with two alphabet sizes (64 an 256), an two structures (eep an shallow, about 180- an 16-level eep, respectively). Whenever require, we partition these NFAs with the algorithm escribe in Section 4. We recall that the partitioning threshol is platform-specific (Section 3.5). This leas to the number of partitions shown in Table 1 (columns 6-8). In orer to simulate the NFA traversal, we use two kins of input streams. For bioinformatics atasets, we generate traces of length 500,000 (for 1,000 genes) by ranomly selecting symbols from the {A, C, G, T} alphabet. For NIDS an synthetic atasets, we generate 256k character traces through our trace generator [34], setting the probability to move eeper in the NFA (p forw ) to 0.5 an 0.9. The traversal characteristics (average an maximum number of active states per input character, an number an frequency of matches) are reporte in Table 1 (columns 9-12). 5.3 Results Resource utilization is reporte in Table 2 for GPU an FPGA, an in Table 5 (columns 7-12) for the AP. For GPU, we recall that the NFA is store in global memory, while the active state information (encoe in a bit vector) is store in share memory. In the CSR case, the active state

8 Table 3: Traversal throughput (ranges correspon to ifferent numbers of streams for GPU an to ifferent partitions for FPGA) Type Small NIDS Bioinformatics Synth. GPU Name # infant CSR-state CSR-tx Throughput streams Throughput Throughput Throughput (Mbps) (Mbps) (Mbps) l7-filter snort gene_8k gene_12k gene_16k gene_20k gene_8k gene_12k gene_16k gene_20k eep-64char eep-256char shallow-64char shallow-256char information is also store (in queue format) in global memory, an therefore the global memory requirement increases with the number of threa-blocks run. However, as can be seen in Table 2, the global memory utilization is very limite even for the CSR format, an even the largest ataset occupies only up to 133MB of the 12GB global memory. We recall that NFA partitioning is riven by the use of 16-bit state ientifiers, an share memory stores two bitmaps inicating the states active at the beginning an the en of each traversal step. Therefore, the use of partitions with at most 64k states limits the per-block share memory utilization to 16KB in the worst case, allowing at least 6 blocks to resie on a SM an hie each other s memory latencies. For FPGA, to facilitate the place&route process, we size the partitions so as to use up to 70% of the flip-flop capacity. Since the consiere evice has twice as many flip-flops as LUTs, on most experiments this setting leas to near full slice utilization. For the AP (Table 5), we report both the ieal utilization (the number of blocks an AP cores that a ataset woul require base on the number of its STEs an reporting elements), an the utilization numbers reporte by the AP s SDK (real utilization). The utilization efficiency in column 11 is the ratio between the ieal an real block utilization. As can be seen, ue to the place&route constraints on the routing matrix, the real utilization is significantly higher than the ieal one. Note that shallow synthetic atasets have significantly lower utilization efficiency (~20%) than eep ones (>80%): this is because the noe out-egree of non-terminal states is large for shallow an low for eep atasets, making the former much harer to route. Due to the generally low utilization efficiency, we partitione all the state-reuce NFAs so that each partition woul require 50% (rather than the whole) half-core capacity. This le to the number of AP partitions shown in Table 1 (column 8). In aition, we experience that the AP SDK tools run out-of-memory when FPGA processing large atasets. To avoi this, for statereuce NFAs we groupe partitions into batches of per evice (epening on the transition ensity of the (Gbps) ataset), an we run the AP SDK on one batch at a time. In the table, for each state-reuce ataset we report the cumulative results over all batches. In case of large fixe topology atasets (100gene*), which consist of many small NFAs with the same topology, we size each batch so as to use all 32 cores on the AP. Since the place&route algorithm use by the SDK is proprietary, this was a trial-an-error process. For these atasets, we report the number of batches (which correspons to the number of AP boars require), an the per-batch ata. As can be seen, for small k (i.e., small hamming istance NFAs) the place&route is easier an the number of STEs/batch an utilization efficiency are higher. Since larger hamming istance NFAs are harer to place, the utilization efficiency ecreases as k increases. Traversal throughput is compute using the following formulas, which assume 8-bit inputs. We assume that matches are reporte every 64K inputs (maximum IP packet length) for NIDS atasets, every 500 inputs (length of relevant portion of a gene) for bioinformatics atasets, an every 1000 inputs for synthetic atasets (N inputs ). For FPGA, we use the worst-case, post-place&route operating frequency reporte by the Xilinx tools. The number of cycles require to report the matches (N output_processing_cycles ) is equal to the ratio between the number of matching states in the NFA an the number of output pins on the FPGA evice. For the AP, we performe estimates base on the 133 MHz nominal operating frequency an the 291 clock cycle output processing time. Table 4: Preprocessing overhea (in case of large atasets, we show minimum an maximum per-partition ata) Type Small NIDS Bioinformatics Synt. GPU FPGA Name Parsing Mem. L. Loaing Parsing Verilog gen. Synt.+ (sec) gen. (sec) mem. (ms) (sec) (sec) p&r (min) l7-filter snort gene_8k gene_12k gene_16k gene_20k gene_8k gene_12k gene_16k gene_20k eep-64char eep-256char shallow-64char shallow-256char

9 Bioinformatics Synth. Type Small NIDS fixetopology statereuce Name # batches # states/ STEs ANML-NFA characteristics # start states # report states Table 5: AP Results Ieal utilization # # cores blocks Resource utilization from SDK profiling # # % utiliz. cores blocks efficiency # AP boars SDK preprocessing time p&r (sec) Comp (sec) total (min) Throughput per evice (Mbps) L7-Filter 1 4k Snort k gene_8k 1 177k 10k 25k gene_20k 1 462k 10k 24k gene_8k k 72k 181k gene_12k 5 597k 21k 53k gene_16k 9 436k 12k 29k gene_20k k 9k 21k gene_8k 1 56k 8 21k gene_20k 1 317k 8 22k gene_8k 1 230k k gene_12k 2 985k k k k gene_16k k k k k gene_20k k k 56 10k k k eep 64char 3 800k char 3 752k shallow 64char 3 779k k k k char 3 801k k k k Throughput ata are shown in Table 3 for GPU an FPGA an in Table 5 for the AP. As can be seen, while able to fit even large atasets on a single evice, GPU reports the lowest throughput ata. In the GPU experiments, we configure the threa-block size to 256 an 32 for bioinformatics an NIDS/synthetic atasets, respectively. This is because we expecte bioinformatics atasets to have larger active sets (as confirme in Table 1). We recall that the number of threablocks run is equal to the prouct between the number of partitions an the number of input streams processe. To avoi ile SMs an ensure processing all partitions, we set the number of blocks to be at least equal to the number of SMs an of partitions. We then increase the number of blocks (an, as a consequence, of streams) until noticeable throughput improvements coul no longer be observe. We make two observations. First, GPU resources are better utilize when processing a large number of input streams, leaing to better throughput. Secon, while the infant kernel greatly outperforms the CSR kernels on small atasets, the CSR-state kernel reports better performance on bioinformatics atasets with large k. On atasets with a large number of partitions, infant is penalize by looping through a large number of transitions that originate from inactive states. Since large atasets require multiple FPGAs an AP boars (or multiple iterations through the same boar), for FPGAs an the AP we report the traversal throughput per evice. Since for most partitions the slice capacity is fully utilize, the number of FPGA evices require is equal to the number of FPGA partitions (Table 2/column 7), while the number of AP boars require is reporte in Table 5/column 12. For small atasets requiring only a small portion of the evice, both platforms can run multiple streams by replicating the NFA. In case of FPGA to utilize ~70% of slice capacity, we run 6 an 4 streams for l7-filter an snort534 respectively. In case of the AP, we consier that chips can be groupe into logical cores processing streams in parallel (Section 2.2). As can be seen, on large atasets (100ups* an synthetic) FPGAs outperform the AP up to a factor ~2.6x, while requiring 2-3x more evices than the AP. Preprocessing cost: In this section, we focus on the platformspecific preprocessing time. The NFA optimization an partitioning steps, common to all platforms, take from 3 to 249 sec (smallest to largest ataset). After these two steps, we save the NFA into file. As can be seen from Table 4, the GPU preprocessing is mostly relate to the parsing of the NFA partition files, an varies from 5 sec to about 4.5 min. For FPGA, synthesis an place&route account for most of the preprocessing time, an preprocessing a large partition may require up to 165 minutes (leaing to several hours for the full atasets). Similar preprocessing times are observe on the AP (for example, the preprocessing time for the shallow-256-char ataset is about 12 hours). In aition, the preprocessing time increases with the transition ensity (eep atasets are preprocesse must faster than shallow ones), whereas the alphabet size has a lesser effect (since on the AP transition symbols are associate to STEs an store in memory). As mentione, the AP preprocessing time can be reuce in case of atasets with known topology (i.e., fixe topology atasets) by pre-compilation. However, fining a configuration that fully uses the AP is a trial-an-error process. Overall Comparison: Figure 5 summarizes the results (note that throughput an preprocessing time are in logarithmic scale). As can be seen, FPGAs provie the best traversal throughputs (up to ~2.6x those of the AP) at the cost of significant preprocessing times (~hours); GPUs eliver moest traversal throughputs (~Mbps) but incur limite preprocessing time (~secons-minutes) an can accommoate large atasets on a single evice; Micron s AP is an intermeiate choice between FPGAs an GPUs, an is most suite for applications that use atasets consisting of many small NFAs with a fixe topology.

Figure 5: Traversal throughput, evice utilization (in terms of number of evices) an overall preprocessing time Power consumption: While the AP is not yet on the market, its esign aims to a worst-case

Xilinx s Power Analyzer estimates the FPGA power consumption to be between 2.09W an 2.36W on ifferent partitions.

6 CONCLUSION To summarize, large atasets with more than 100-200 thousan states must be partitione in orer to be eploye on GPUs, FPGAs an Micron s AP.

10 Figure 5: Traversal throughput, evice utilization (in terms of number of evices) an overall preprocessing time Power consumption: While the AP is not yet on the market, its esign aims to a worst-case power consumption of 4W per chip [23]. Due to lack of space, here we report power ata only on a meium-size ataset (100genes_12k). Xilinx s Power Analyzer estimates the FPGA power consumption to be between 2.09W an 2.36W on ifferent partitions. In contrast, GPU experiments on Texas State's Marcher system report an average GPU power consumption of W an W on the best an worst implementations/kernel configurations, respectively. 6 CONCLUSION To summarize, large atasets with more than thousan states must be partitione in orer to be eploye on GPUs, FPGAs an Micron s AP. While for GPUs, partitioning is require only to effectively use the GPU resources (e.g., on-chip memory), FPGAs an the AP require splitting large NFAs onto multiple evices. On these large atasets, logic-base FPGA esigns can outperform the AP by a factor ~2x, while requiring 2-3x more evices to accommoate the ataset; GPUs unerperform FPGAs by up to a factor 900x. GPUs in general eliver low performance on a single input stream, but their cumulative throughput scales up to thousans of input streams. GPUs offer the avantage of limite preprocessing time (up to a few minutes on million-state NFAs), while FPGAs an AP can take several hours to preprocess the same atasets. Precompiling the NFA can hie the AP s preprocessing time, but this is possible only if the topology of the NFA is known a priori (e.g., Hamming or Levenshtein istance NFAs). Fining an NFA configuration that uses all 32 AP cores is a trial-an-error process that can require about an hour per experiment. Finally, ue to routing constraints, AP s SDK can keep utilization efficiency as low as 20%, while the FPGA utilization is more preictable given the NFA size. ACKNOWLEDGMENTS This work has been supporte by NSF awars CNS an CCF , an by the Institute for Critical Technology an Applie Science (ICTAS: REFERENCES [1] S. Kumar et al., Algorithms to accelerate multiple regular expressions matching for eep packet inspection, in Proc. of SIGCOMM [2] S. Kumar et al., Curing regular expressions matching algorithms from insomnia, amnesia, an acalculia, in Proc. of ANCS [3] S. Kumar et al., Avance algorithms for fast an scalable eep packet inspection, in Proc. of ANCS [4] M. Becchi, an P. Crowley, An improve algorithm to accelerate regular expression evaluation, in Proc. of ANCS [5] M. Becchi, an P. Crowley, A hybri finite automaton for practical eep packet inspection, in Proc. of CoNEXT [6] M. Becchi, an P. Crowley, Extening finite automata to efficiently match Perl-compatible regular expressions, in Proc. of CoNEXT [7] R. Smith et al., Deflating the big bang: fast an scalable eep packet inspection with extene finite automata, in Proc. of SIGCOMM [8] A. X. Liu, an E. Torng, An overlay automata approach to regular expression matching, in Proc. of INFOCOM [9] X. Yu et al., Revisiting State Blow-up: Automatically Builing Augmente-FA while Preserving Functional Equivalence, JSAC [10] N. Cascarano et al., infant: NFA pattern matching on GPGPU evices, SIGCOMM Comput. Commun. Rev., vol. 40, no. 5, pp , [11] Y. Zu et al., GPU-base NFA implementation for memory efficient high spee regular expression matching, in Proc. of PPOPP [12] X. Yu, an M. Becchi, GPU acceleration of regular expression matching for large atasets: exploring the implementation space, in Proc. of CF [13] R. Sihu, an V. K. Prasanna, Fast Regular Expression Matching Using FPGAs, in Proc. of FCCM [14] M. Becchi, an P. Crowley, Efficient regular expression evaluation: theory to practice, in Proc. of ANCS [15] Y.-H. E. Yang et al., Compact architecture for high-throughput regular expression matching on FPGA, in Proc. of ANCS [16] A. Mitra et al., Compiling PCRE to FPGA for accelerating SNORT IDS, in Proc. of ANCS [17] B. C. Broie et al., A Scalable Architecture For High-Throughput Regular- Expression Pattern Matching, in Proc. of ISCA [18] J. Van Lunteren et al., Designing a Programmable Wire-Spee Regular- Expression Matching Accelerator, in Proc. of MICRO [19] Y. Fang et al., Fast support for unstructure ata processing: the unifie automata processor, in Proc. of MICRO [20] M. Becchi et al., Evaluating regular expression matching engines on network an general purpose processors, in Proc. of ANCS [21] P. Dlugosch et al., An Efficient an Scalable Semiconuctor Architecture for Parallel Automata Processing, TPDS, vol. PP, no. 99, pp. 1-1, [22] I. Roy, an S. Aluru, Fining Motifs in Biological Sequences Using the Micron Automata Processor, in Proc. of IPDPS [23] K. Wang et al., Association Rule Mining with the Micron Automata Processor, in Proc. of IPDPS [24] K. Zhou et al., Regular expression acceleration on the micron automata processor: Brill tagging as a case stuy, Proc. of Big Data [25] I. Roy et al., High Performance Pattern Matching Using the Automata Processor, in Proc. of IPDPS [26] I. Roy et al., Algorithmic Techniques for Solving Graph Problems on the Automata Processor, in Proc of IPDPS [27] K. Wang et al., Sequential pattern mining with the Micron automata processor, in Proc. of CF [28] J. E. Hopcroft, an J. Ullman, Introuction to automata theory, languages, an computation: Aison-Wesley, Reaing, Massachusetts, [29] F. Yu et al., Fast an memory-efficient regular expression matching for eep packet inspection, in Proc. of ANCS [30] K. Angstat et al., RAPID Programming of Pattern-Recognition Processors, in Proc. of ASPLOS [31] G. Karypis, an V. Kumar, A Fast an High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM J. Sci. Comp., v. 20, n. 1, pp , [32] J. Waen et al., ANMLzoo: a benchmark suite for exploring bottlenecks in automata processing engines an architectures, in Proc. of IISWC [33] A. To et al., Parallel Gene Upstream Comparison via Multi-Level Hash Tables on GPU, in Proc. of ICPADS [34] M. Becchi et al., A workloa for evaluating eep packet inspection architectures, in Proc of IISWC 2008.

Study of Network Optimization Method Based on ACL

Study of Network Optimization Method Based on ACL Available online at www.scienceirect.com Proceia Engineering 5 (20) 3959 3963 Avance in Control Engineering an Information Science Stuy of Network Optimization Metho Base on ACL Liu Zhian * Department