Politecnico di Torino. Porto Institutional Repository

Size: px

Start display at page:

Download "Politecnico di Torino. Porto Institutional Repository"

Gavin Ellis
5 years ago
Views:

Politecnico i Torino Porto Institutional Repositor [Article] Scalale Algorithms for NFA Multi-Striing an NFA-Base Deep Packet Inspection on GPUs Original Citation: M. Avalle; F. Risso; R. Sisto (6).

- ISSN 63-669 Availailit: This version is availale at : http://porto.polito.it/656

1 Politecnico i Torino Porto Institutional Repositor [Article] Scalale Algorithms for NFA Multi-Striing an NFA-Base Deep Packet Inspection on GPUs Original Citation: M. Avalle; F. Risso; R. Sisto (6). Scalale Algorithms for NFA Multi-Striing an NFA-Base Deep Packet Inspection on GPUs. In: IEEE-ACM TRANSACTIONS ON NETWORKING, vol. n. 3, pp ISSN Availailit: This version is availale at : since: Octoer 6 Pulisher: IEEE Pulishe version: DOI:.9/TNET Terms of use: This article is mae availale uner terms an conitions applicale to Open Access Polic Article ("Pulic - All rights reserve"), as escrie at html Porto, the institutional repositor of the Politecnico i Torino, is provie the Universit Lirar an the IT-Services. The aim is to enale open access to all the worl. Please share with us how this access enefits ou. Your stor matters. (Article egins on next page)

2 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 Scalale Algorithms for NFA Multi-striing an NFA-ase Deep Packet Inspection on GPUs Matteo Avalle, Fulvio Risso, Memer, IEEE, an Riccaro Sisto, Senior Memer, ACM Astract Finite State Automata (FSA) are use man network processing applications to match complex sets of regular expressions in network packets. In orer to make FSA-ase matching possile even at the ever increasing spee of moern networks, multi-striing has een introuce. This technique increases input parallelism transforming the classical FSA that consumes input te--te into an equivalent one that consumes input in larger units. However, the algorithms use toa for this transformation are so complex that the often result unfeasile for large an complex rule sets. This paper presents a set of new algorithms that exten the applicailit of multistriing to complex rule sets. These algorithms can transform Noneterministic Finite Automata (NFA) into their multi-strie form with reuce memor an time requirements. Moreover, the exploit the massive parallelism of Graphical Processing Units for NFA-ase matching. The final result is a oost of the overall processing spee on tpical regex-ase packet processing applications, with a speeup of almost one orer of magnitue compare to the current state of the art algorithms. I. INTRODUCTION Regular expression (regex) pattern matching is currentl one of the most wiel use techniques to inspect network traffic, also ecause of the possiilit to e transforme into equivalent Finite State Automata (FSA), which provies a soli mathematical founation an enales the use of wellknown algorithms. For instance, simple algorithms exist to etermine if a particular input matches a regular expression an for composing multiple FSAs together, such as when multiple rules (e.g., the several thousans patterns use a NIDS) have to e checke within a single scan of the input. FSAs come in two ifferent forms with equivalent expressiveness, namel Deterministic FSA (DFA) an Noneterministic FSA (NFA). A DFA guarantees a oune execution time for the matching operation on an execution architecture, ecause for each input string a single path in the automaton has to e followe. However, the oune execution time is achieve at the expense of a potentiall huge memor consumption, particularl when complex regular expressions (e.g., with frequent use of repetition wilcars.* ) are use, or when man expressions are comine together in a single DFA. While this prolem can e alleviate using variations of the DFA, important limitations still exist on the numer an the complexit of the rule sets that can e hanle using this form. This ma force application evelopers to use approximate (an simpler) regular expressions, which ma either impair the capailit to filter out exactl the esire Dipartimento i Automatica e Informatica, Politecnico i Torino, Ital {matteo.avalle, fulvio.risso, riccaro.sisto}@polito.it Manuscript receive XXX; revise XXX. traffic or potentiall generate man false positives, which ma e particularl critical in securit applications. The NFA form solves the previous prolem ecause the numer of states in the automaton is irectl proportional to the length of the regular expressions use to create the FSA. However, this ma lea to a non-preictale execution time, which ma grow ramaticall when strictl sequential matching algorithms are use. In fact, with a NFA, the matching of a given input ma require to follow a numer of paths in the automaton which is not preictale an epens on the complexit of the automaton an on the particular input ata that are eing processe. For this reason, software-ase implementations with strict processing time constraints (such as packet processing applications running on general purpose CPUs) are usuall ase on DFA or on some of its variations. Recentl, we propose infant [], a new software solution for regex matching ase on NFA that exploits the high parallelism of Graphical Processing Units (GPU) in orer to reuce the variailit of NFA processing times an make this approach usale in practice. This paper extens that work exploring the possiilit to improve the throughput means of multi-striing, i.e. the transformation of the automaton into one that processes more than one te of input at each step. Using this technique, if a DFA takes N steps to perform a matching, where N is the length of the input ata (e.g., the size of a network packet), a version of the DFA returns the same result in N/ steps. With a NFA, similar reuctions can e achieve, hence potentiall ouling the throughput at each strie ouling. Figure presents an example of a simple regular expression with its corresponing stanar an NFA. In the figure, the states with the ashe orer are the initial states while those with thick orers are the final (i.e. accepting) states. The notation 55 stans for an smol in the range from to 55. This means that the transition with that lael is in fact a set of transitions, one for each possile input smol. In the NFA each transition consumes two input smols an correspons to the comination of two ajacent transitions of the original NFA. For instance, in the original automaton the input string ac triggers two transitions: a transition from state q to state q that consumes a, followe a transition to state q that consumes c. In the automaton, instea, a single transition leas irectl from q to q an consumes the pair of smols ac. Although, in theor, multi-striing can e applie grouping together an aritrar numer of input smols, in practice the use of too man smols leas to so complex automata that the uiling process cannot e complete in reasonale time (an memor). Therefore, multi-striing is usuall im-

3 on Group Group Threa Block (3 threas) IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 (a) Regexp () NFA (c) -Strie NFA a,c, 55 a*c a c a, q q q q q 55,a q,c c, q3 55 q3 55, 55, Fig.. A simple regular expression in its (a) textual representation, () NFA form an (c) NFA form. plemente iterativel ouling the strie level until further ouling ecomes unfeasile, i.e. first the NFA is transforme into a NFA, then the latter is transforme into a - strie NFA, an so on. As shown intuitivel in Figure, at each strie ouling the numer of states oes not change, ut the numer of transitions ma increase quaraticall in the worst case. For this reason, multi-striing is usuall couple with Alphaet Compression [], [3], which reuces the numer of transitions performing a compression of its input alphaet. Unfortunatel, current state-of-the-art algorithms for multi-striing an alphaet compression show poor scalailit properties, hence supporting onl ver simple automata. Hence the are not compatile with the complex FSA that result from real-life rule sets for network packet inspection. This paper aresses the aove prolem in two irections. First, it enhances in various was the current multi-striing an alphaet compression techniques in orer to cope with the complexit of the tpical rule sets use in real applications. Secon, it shows how the GPU-ase algorithm of infant [] can e aapte to efficientl execute the resulting multi-strie automata. The final result is the possiilit to match complex rule sets in real time on high-spee networks, which was not possile with the previous software-ase techniques. This paper is structure as follows. Section II presents the state of the art, Section III provies ackgroun information aout infant, multi-striing, an alphaet compression, while Sections IV to VII escrie the new algorithms. Section VIII presents enchmark results of the new algorithms while, finall, Section IX raws conclusions. II. RELATED WORK Although the iea of processing more than one character at a time appeare in various previous papers, an algorithm for computing multi-strie DFAs strie ouling was escrie for the first time in []. The same paper also propose to comine multi-striing with alphaet compression, in orer to reuce the quaratic increase of the numer of transitions that occurs at each strie ouling step. Some improvements of these techniques, in terms of computational complexit, have een presente in [3] (later extene in []) an in [5] that provie, respectivel, a new algorithm for alphaet compression an a new strie ouling algorithm, oth with reuce computational complexities compare with the previous ones. [5] also presente the extension of these algorithms to NFA. Another algorithm for NFA strie ouling with similar complexit ut without alphaet compression was presente in [6], while in [7] we showe our preliminar results towar more scalale algorithms for multi-striing. An alternative NFA representation ase on Orere Binar Decision Diagrams (NFA-OBDD) is propose in [8]. This approach is shown to outperform PCRE, a popular tool for regex matching ase on NFA algorithms, when ealing with large rule sets. However, the throughput reporte using the propose approach is quite moest if compare to the one of [], which exploits the parallelism of GPUs in orer to counter the increase an variailit of execution times. [8] also explores multi-striing with alphaet compression appling the ol algorithms propose [] on the NFA efore generating the OBDD representation. Of course, the introuction of multi-striing gives some improvement, ut the scalailit prolems of the multi-striing an alphaet compression techniques propose in [] are confirme here, as the cannot excee the x strie level. A common limitation of all the algorithms presente so far for computing the multi-strie versions of automata is their high CPU an memor consumption. As a consequence, when ealing with the complex rule sets foun in real-life network applications, the multi-strie automata cannot e compute at all, or onl ver limite strie levels can e achieve with reasonale resources, as shown in [8]. These limitations are overcome our improvements on the state of the art algorithms. Particularl, the alphaet compression algorithm has een reesigne to use multiple (ut smaller) conversion tales at the same time, rather than just a single one, a technique that was originall propose in [9] for stanar (non multi-strie) automata. In aition, we also propose a more effective heuristic to istriute smols across the multiple tales. Other papers propose ver efficient techniques for regular expression matching that are either not aequate for software implementation or not applicale to the general prolem of regex matching with complex patterns. For instance, [] proposes also a run-length encoing for automata that relies on priorit encoers an TCAMs while [] proposes another solution ase on TCAMs, oth est suite to implementations ase on ASICs or FPGAs. Another example is the varialestrie technique presente in [] which applies onl to string matching an oes not support regular expressions. Along this line, several papers [] [3] [] [5] propose fast GPUoriente algorithms that operate on DFA-ase representations. In orer to eal with complex regex rule sets, the patterns that generate an excessive amount of states are kept in the NFA form an execute on the main CPU ut this leas to poor performance. Recentl, [6] an [7] propose new regex matching algorithms that are specificall esigne for running NFA on GPUs an that improve the solution of infant [], mainl removing the useless processing of the transitions that start from inactive states. Our optimize algorithms for creating multi-strie automata provie a contriution that has more general valiit, i.e. the multi-strie NFA create our algorithms coul e exploite, in principle, an NFA-ase regex processors, incluing the ones propose in [6] an [7], after proper aaptation. In particular, aaptation is necessar if multimap alphaet compression (Section VI) is use. Then, the aove papers can represent a vali alternative to our GPU-

4 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 3 ase run-time regex engine presente in Section VII (ase on []), ut further work is necessar to aapt them to support our multimap algorithms. III. NOTATION AND BACKGROUND Tale I lists the main smols use in this paper along with a short escription of their meaning. Q δ Σ A X N fo L s R A. NFA TABLE I COMMON SYMBOLS AND NOTATION USED IN THIS PAPER Set of states of an NFA Set of transitions of an NFA Set of smols of an NFA (alphaet) Set of accepting states of an NFA Numer of elements of set X (carinalit) Average numer of states of an NFA that can e reache starting from a single state, following all its outgoing transitions (average fan-out) Average numer of transitions of an NFA connecting the same pair of states (average lael size) Average numer of transition ranges of an NFA connecting the same pair of states (similar to L s ut it counts ranges instea of single transitions) A NFA can e formalize as a 5-tuple Q, Σ, q, δ, A where Q is the set of states, Σ is the set of smols (the alphaet), q Q is the initial state, δ Q Σ Q is the transition function, i.e. a set of transitions, where each transition is a triple (q, s, q ) with initial state q Q, lael s Σ an final state q Q, an A Q is the set of accepting states. A NFA takes as input a sequence of smols s,..., s n, with s i Σ. A match is foun at smol s k if there exists a sequence of k + states q,..., q k starting with the initial state an terminating with an accepting state q k, an there exists a sequence of transitions laele s,..., s k that in each state of the sequence to the next state, i.e. i [, k] : (q i, s i, q i ) δ. The classical sequential NFA matching algorithm keeps track of all the states that can e reache after each input smol an reports a match when an accepting state is reache. The algorithm is ase on keeping a set of active states that, at the eginning, inclues onl the initial state. Input smols are rea sequentiall an for each input smol the next set of active states is compute following all the enale transitions, i.e., the ones that fire ase on the given input smol an that start from an of the current active states. When the NFA represents a set R of regular expressions, it is possile to a the information aout which regular expressions are matche in each acceptance state means of a function rules : A R that maps each acceptance state onto the set of matche regular expressions. B. Multi-striing Algorithm shows a simplifie version of the state-of-the art strie ouling algorithm presente in [5] which uils the new set of transitions δ enumerating, for each reachale Algorithm The strie ouling algorithm of [5]. : δ = {} : queue = {q }; processe = {q } 3: while!queue.empt() o : q = queue.pop() 5: for all s A Σ, q Q (q, s A, q ) δ o 6: for all s B Σ, q Q (q, s B, q ) δ o 7: δ = δ {(q, (s A, s B ), q )} 8: if q / processe then 9: queue.push(q ) : processe = processe {q } : en if : en for 3: en for : en while state q, all the possile cominations of two consecutive transitions (lines -6). For each of those cominations, the algorithm generates a transition in the new automaton that is equivalent to the two original transitions. For example, the new transition that rings irectl from q to q will e laele with the concatenation of the laels associate with the two original transitions (line 7). The algorithm starts processing the initial state (line ) an then it iterates through all the states that can e reache using the generate compoun transitions (lines 8-). For simplicit, the presente pseuo-coe oes not inclue the particular case where the intermeiate state q is an accepting state while q is not. This case is hanle in [5] also keeping the information aout the rules matche in each acceptance state: for each rule r matche in q, an extra transition is ae leaing to a (possil extra) accepting state q a with no outgoing transitions such that rules(q a ) = {r}. The asmptotic time complexit of the algorithm can e evaluate as Q (N fo L s ) where the meaning of N fo an L s is explaine in Tale I. This formula erives from the oservation that, starting from each of the Q states of the NFA, it is necessar to iterate through all its N fo L s outgoing transitions an, then, for each reache state, N fo L s compoun transitions are ae. C. Alphaet compression This step reuces the exponential growth of the alphaet size Σ with the strie level (the size ecomes Σ k for the k-strie NFA). This size is irectl proportional to the L s factor of the strie ouling asmptotic complexit formula. Hence, its growth impacts not onl on the complexit of the resulting NFA ut also on the possiilit for the strie ouling algorithm to terminate in a reasonale time. As a consequence, alphaet compression is execute after each strie ouling step, in orer to ecrease the carinalit of the new alphaet. The main iea ehin an alphaet compression algorithm is that often there are smols that are equivalent, i.e. the alwas trigger exactl the same set of transitions. This happens ecause often the patterns use to scan the network traffic are limite to alphanumeric characters (e.g., -9A-Za-z ) an to a few other smols, while the rest are ignore. If two (or more) input smols (e.g., aa an ) alwas originate the same transitions, the are replace with a single one (e.g., α ), thus ecreasing the numer of input smols in the alphaet.

5 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 q q q a f,a f e z,e z 'a' 'e' 'f' 'z' 'a' 'e' 'f' 'z' q->q q->q q->q Fig.. An example of alphaet compression. q->q In essence, this process enales each new smol of the new alphaet to represent an entire class of smols of the original alphaet; as a consequence, the translate NFA that uses this ictionar has a smaller numer of transitions than the original one although the two NFA are functionall equivalent. Oviousl, as alphaet compression changes the alphaet of the NFA, each input string (i.e., each packet) has to e translate to the new alphaet sustituting pairs of consecutive smols with the new smols assigne to their equivalence classes. However, as the time require to perform this operation is generall lower than the time require for matching, the overhea of this ata translation can e consiere negligile an it can e pipeline with the regular expression matching task. Figure shows the equivalence classes uilt for a simple NFA. The map on the right han sie is a graphical representation of the space of smol pairs to e partitione into equivalence classes (each cell in the map represents a single smol pair). Smol pairs are partitione into equivalence classes accoring to the transitions the can trigger. For example, the equivalence class of smol pairs that can trigger onl a transition from q to q (the area laele with q q ) is mae up of the smol pairs a f,a f with the exclusion of e f,e f. In fact, the pairs e f,e f trigger two transitions, from q to q an from q to q, so that the constitute another equivalence class. The total numer of classes is : the other classes are mae up of the smol pairs that can trigger onl a transition from q to q, an those that can trigger no transition (the area of the map not covere rectangles). Consequentl, the new alphaet is mae up of onl smols, each one assigne to one equivalence class, an the translation ictionar replaces all the smols of each equivalence class with the new smol assigne to that class. Determining the equivalence classes uiling the sets of transitions triggere each smol pair is unfeasile with large automata, mainl ecause of the huge amount of memor require: the size woul e O( Σ Q ), as for each cell of the map a set of state pairs shoul e store. The algorithm propose in [] can perform the same operation using just one integer an one oolean for each smol pair, ut its time complexit is O( Σ Q ). The algorithm that can e consiere the current state of the art [] traes a slight increase in memor consumption (it requires integers an ooleans per smol pair) for a noticeale improvement in time efficienc. When comining this algorithm with strie ouling, [] proposes to perform a preliminar compression step, followe one or more strie-ouling steps, each one followe a compression step. Algorithm shows the alphaet compression algorithm propose in [], aapte to the case of an input NFA having an alphaet mae of pairs of smols (the alphaet is Σ Σ), like the one otaine from a strie ouling step. With this algorithm the translation ictionar (an arra of integers calle map) is uilt in an iterative wa that the authors call cluster ivision: initiall all the elements of the ictionar are fille with the same value (), meaning that all the possile smol pairs are translate to the same equivalence class, an then it iterativel ivies the classes consiering separatel each possile comination of states (q, q ) Q Q (lines -3). The main iea of each ivision step is that a smol pair a has to e remappe to a new class if from q to q there is no transition laele with it ut there are transitions laele with other smol pairs that previousl elonge to the same class as a. The algorithm uses the two arras of ooleans name char an class to recor respectivel the set of smol pairs that lael the transitions from q to q an the classes covere these smol pairs. A first iteration computes these arras an a secon iteration oes the necessar remapping, using a support arra of integers name remap, which recors the alrea performe remapping operations. Algorithm The alphaet compression algorithm presente in [], rewritten for a strie- NFA : map[ Σ ][ Σ ] = ; size = : for all q Q o 3: for all q Q o : char[ Σ ][ Σ ] = false 5: class[ Σ Σ ] = false 6: remap[ Σ Σ ] = 7: for all (a, ) Σ Σ (q, (a, ), q ) δ o 8: char[a][] = true 9: class[map[a][]] = true : en for : for all (a, ) Σ Σ o : if!char[a][] & class[map[a][]] then 3: if remap[map[a][]] = then : remap[map[a][]] = + + size 5: en if 6: map[a][] = remap[map[a][]] 7: en if 8: en for 9: en for : en for The asmptotic time complexit of this algorithm can e evaluate as Q Σ. The Σ factor is ue to the iteration through the Σ elements of the alphaet, while the Q factor comes from the repetition of these iterations for ever pair of states. Σ represents the most critical factor, as Σ rapil increases with the strie level while Q is almost constant. Also memor consumption is a concern, as the size of the ata structures use the algorithm are proportional to Σ. Due to these limiting factors, reaching high strie levels with ig rule sets using the current state-of-the-art algorithms results unfeasile. D. infant infant [] is an efficient NFA-ase regex matching processor that runs on GPUs. Its algorithm consists of iterativel

6 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 5 Input: 'a' Input: '' Input: 'c' Input: 'z' Transitions: Transitions: Transitions: Transitions:... -> -> ->... -> 3 -> > 3 -> 3 Input te: 'a' Active states itmap infant threa -> 3 3 -> 3 threa (a) Uncompresse NFA a, c, q q q3 3 -> 3 -> 3 -> 3 threa 3 infant Fig. 3. Overview of the infant processing structure. Active states itmap (e) Translation maps at first strie iteration a c a c c,[ z] reaing input smols an upating the set of active states, () Compresse NFA represente a it vector where each it correspons to q q q3 a state of the NFA. After reaing a new input smol, the algorithm looks for the transitions that can e triggere that (c) Uncompresse NFA, (f) Translation maps at secon strie iteration smol. This operation is mae simple storing transitions q q q3 alrea groupe input smol. These transitions are then 'a' 'e' 'f' 'z' 'a' 'f' coul e replace a single transition laele with the range q 'a' 'a' 'a' 97. When the smols are not completel contiguous, q->q more a f,a franges 'e' are use. 'e' 'f' 'f' 'f' The range notation is exploite in the strie ouling q q->q algorithm q->q treating ranges q->q as atomic entities uring the e z,e z creation of compoun transitions: rather than iterating on smols, 'z' q the algorithm just iterates on ranges an it creates 'z' I II x x I II the range Input # 5, generates a a a comine set of transitions I II I II Input # z, use to upate a it vector that keeps the active states. As () Compresse NFA shown in Figure 3, several transitions are processe in parallel (a) Regexp () NFA (c) -Strie NFA q q q3 a,c, 55 assigning them to ifferent threas, which contriutes to a*c a c a, q q q q q q q3 z 55,a,c alleviate the epenenc of the upate time on the numer c, q , 55, of transitions triggere the input smol. This is mae possile exploiting the high numer of threas that can run on a GPU an the high memor anwith that can e otaine aopting a clever memor access polic. Threa ivergence is almost non-existent, as ( esign) all threas perform exactl the same operations, although on ifferent ata (i.e., ifferent transitions). However, some useless processing ma occur: some threas analze transitions that are not active in the current starting state an ma ecome ile. In aition to this form of parallelism, infant can also process several input strings in parallel, assigning each string to a ifferent group of threas. The scheuler can then exploit the large numer of threas to hie memor latenc, scheuling new groups of threas when the current ones are waiting for ata from main memor (memor access time is far eon the processor ccle time), hence keeping the processor us almost all the time. Thanks to the two forms of parallelism, infant has a reasonal stale throughput, which epens on the average numer of transitions per smol. When the numer of transitions triggere a smol excees the maximum numer of threas supporte the GPU, the algorithm has to iterate transition processing, hence ecreasing throughput. IV. MULTI-STRIDING WITH RANGE NOTATION The asmptotic time complexit of the state of the art strie-ouling algorithm epens mostl on the numer of compoun transitions to e generate, which tens to increase rapil at each strie step. Hence, the time require for strie ouling ecomes quite large even after the first ouling. In orer to reuce the impact of this prolem, we change the wa transitions are represente: transitions are groupe accoring to their source an estination states an the laels of the transitions of each group are store using ranges instea of iniviual smols. For example, if we consier a NFA with the transitions that link q to q laele all the smols etween 97 an, these transitions compoun transitions comining ranges together. a f,a ffor example, a transition set laele the ranges 97 an, when comine with another set laele onl e z,e z q q q laele with just two pairs of ranges: 97,5 an,5. Moreover, the generation of these t t t3 t t9 t t t t7 t8 t9 t t5 t6 t7 t8 t5 t6 t7 t8 t3 t t5 t6 t t t3 t t9 t3 t3 t3 compoun transitions requires onl two iterations of the algorithm, compare Group Group Threa Block to (3 threas) the huge numer of iterations necessar if working with iniviual smols. With range notation, the asmptotic time complexit of strie ouling changes from Q (N fo L s ) to Q (N fo R), i.e. the L s factor, which counts the average numer of transitions that link two states, is replace R, which is the average numer of ranges that are necessar to represent these transitions. In the worst case, if a set of transitions oes not inclue an contiguous smols, a numer of ranges equal to the numer of smols laeling the transition is generate. However, as regular expressions usuall hanle human-reaale alphanumeric characters an the ASCII coe represents them with contiguous values, the vast majorit of NFA are expecte to enefit from the range notation. Moreover, the avantages of this notation ten to e more visile when the NFA is more complex. In fact, complex NFA have more states an more transitions, hence higher proailit to have transitions with contiguous smols. Finall, even with unluckil incompressile rule sets, it is still possile to greatl exploit the range notation starting from the secon strie ouling iteration: as it will e shown in Section V, our alphaet compression algorithm generates equivalence classes in a wa that increases (if possile) the amount of contiguous laels in each transition set. V. IMPROVING ALPHABET COMPRESSION The state-of-the-art algorithm for alphaet compression [] is appropriate with small to meium size NFA ut still requires prohiitive resources with multiple-strie NFA resulting from rule sets of realistic sizes. Our new algorithm (shown in Algorithm 3) is more efficient in terms of oth memor an processing requirements which allows it to hanle larger automata. Furthermore, the algorithm also improves the effectiveness of the range notation. From the memor stanpoint, the algorithm requires approximatel four times less memor than the one in [], as it nees onl an integer per smol pair (i.e., onl the transition map itself) compare For the sake of precision, with iniviual smols, the Cartesian prouct etween smols woul have generate ( ) ( 9) = 377 pairs of smols. q

7 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 6 q q q a f,a f e z,e z a f a f a e f z a e f z Fig.. Example of the improve alphaet compression algorithm. q q q a f,a f e z,e z to the two integers an two ooleans require [] 3. From the processing stanpoint, our algorithm can potentiall run at twice the spee of [] as it performs a single iteration through the main ata structure instea of two. Algorithm 3 Improve alphaet compression algorithm. : map[ Σ ][ Σ ] = : size = 3: for all q Q o : for all q Q o 5: uffer = {} 6: for all (a, ) Σ Σ (q, (a, ), q ) δ o 7: if map[a][] / kes(uffer) then 8: uffer = uffer {map[a][], + + size} 9: en if : map[a][] = uffer[map[a][]] : en for : en for 3: en for The main iea of the new algorithm is to use a ifferent criterion for the cluster ivision step (propose in []): all the smol pairs that can lea from q to q are remappe onto new equivalence classes, taking care of using the same new class for smol pairs that previousl elonge to the same class. For example, if pairs ( aa ), ( ), an ( cc ) can lea from q to q an pairs ( aa ) an ( ) were previousl mappe to the equivalence class while ( cc ) was mappe to, then ( aa ) an ( ) will e mappe to the same new smol (e.g. ) an ( cc ) to a new other smol (e.g. 3). In this wa, correctness is still guarantee, i.e. if two smol pairs lea to the same estination states from each state, the are mappe to the same class. The uffer hash-map is use to store the re-mapping operations performe at each cluster ivision step (it maps classes onto new classes). Initiall there are no re-mapping operations (line 5). For each smol pair that can lea from q to q, if it was previousl mappe to a class for which a re-mapping oes not alrea exist in the uffer (line 7) a new class is create an a new re-mapping from the previous class to the new class is ae to the uffer (line 8). Vice versa, if a re-mapping alrea exists in the uffer, this is simpl use to remap the smol pair. Figure shows how the algorithm works for the same FSA presente in Figure. The left-han sie of the figure shows the translation map resulting after processing the transitions The vast majorit of the compilers use an integer to store a oolean for performance reasons. 3 For the sake of precision, our algorithm uses also an hashe map whose numer of entries is equal to the average numer of equivalence classes the transitions connecting two states elong to; however the size of this ata structure can e consiere negligile compare to the main ata structure. from state q to state q. All the smol pairs that can lea from q to q, corresponing to the square area with coorinates from a,a to f,f, have een re-mappe to the same new class. In the secon step, the transitions from state q to q are examine. The are laele smol pairs corresponing to the square area with coorinates from e,e to z,z. This area can e ivie into two parts: the former that overlaps with the area previousl fille with class an the latter that oes not overlap with it an is still fille with the original class. These two parts are re-fille with two ifferent new classes, namel an 3, thus leaing to the situation shown in the rightmost part of the figure. Hence, as expecte the algorithm has generate four ifferent classes: class, corresponing to unuse smol pairs, class corresponing to the smol pairs that can trigger onl the q q transitions, class 3 corresponing to the smol pairs that can trigger onl the q q transitions, an class, corresponing to the smols that can trigger oth. The onl ifference in terms of output etween the new algorithm an [] is that we perform a remapping even when it is not strictl necessar, possil leaing to non-contiguous class numers. For example, if the algorithm generates classes it is possile that those are numere,, 3, an, as the class was overwritten in some intermeiate steps of the algorithm. Even if the wa integer values are assigne to classes is irrelevant from a theoretical point of view, the assignment of non-contiguous values ma negativel affect strie ouling, ecause the effectiveness of the range notation ecreases. For this reason, an aitional phase is ae to the algorithm where the generate classes are renumere in orer to guarantee that the are ientifie contiguous integers.this step is performe insie the algorithm that translates the NFA to the new alphaet (which has to e execute anwa after alphaet compression), hence aing a ver small overhea to the overall process. The translation algorithm is ver simple: it iterates through each transition of the NFA an sustitutes the ol smol pair with the corresponing class numer. When oing so, the class is also renumere (if neee). This as a couple of memor accesses at each iteration an it requires keeping a ictionar to rememer which classes have alrea een renumere. Clearl, the aitional cost of class renumering is negligile compare to the rest of the algorithm. The same applies for memor requirements, as the size of the aitional ictionar is equal to the carinalit of the compresse alphaet, which is wa less than the size of the support map. Finall, as an ae value, class renumering chooses new class inexes with an heuristic that increases the amount of contiguous values in each set of transitions that connect two states: this is one in orer to increase the efficienc of the range notation (i.e., to reuce R). As the asmptotic time complexit of the strie ouling algorithm is proportional to R, reucing R spees up the next strie ouling iteration. The heuristic is ase on the following iea. Let us consier the first state pair that is processe. If, for instance, the transitions that link these states are laele the 3 equivalence classes 563, 36 an 3, the algorithm renames class 563 to, class 36 to an class 3 to 3, thus guaranteeing

8 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 7 the contiguousness of their smols. For the next state pairs, onl the classes that occur for the first time are re-mappe to new (consecutive) integers. Hence, the laels of the transitions connecting the state pairs that are processe first will more likel have more contiguous smols, while the ones of the state pairs processe last will proal have more sparse sets. B processing state pairs with more complex transition sets first, the overall numer of ranges is reuce. This can e achieve in practice sorting the state pairs accoring to the complexit of their transition sets efore performing renumering. The time complexit of the new alphaet compression algorithm can e expresse as Q L s. Compare to the complexit of the algorithm in [] an reporte in Section III-C, it can e note that the L s factor replaces Σ. This is an important improvement ecause usuall L s is lower than Σ, as tpicall in an NFA onl few states are connecte to each other with all the possile smols of the alphaet. A. Optimize remapping In our experiments we foun that often ifferent state pairs are connecte transitions with the same smols, i.e. the smols that can lea from q x to q are the same that can lea from q z to q t. In this case, the remapping one the algorithm for the first pair of states woul e completel overwritten when processing the secon pair of states, as the refer exactl to the same area in the map, which means that the processing of the secon pair of states coul e safel omitte. As this case is frequent, this optimization leas to an important reuction of the processing time. The high frequenc of this case in real situations epens on several ifferent causes. One is the alrea mentione alphanumeric nature of most regular expressions. Another one is the wa the NFA is tpicall generate from a set of regular expressions. For instance, the comination of two suexpressions the AND operator results in a uplication of the initial set of transitions. However, most frequentl the presence of state pairs with exactl the same transition sets is cause the.* pattern, inclue in man regular expressions, which causes the generation of state pairs q x, q such that an smol in the alphaet can trigger a transition from q x to q. A state pair with this set of transitions forces the algorithm to upate the entire support map, an, eing these state pairs frequent, the are also responsile for a consierale portion of the time require to process the entire NFA. For this reason, omitting the processing of transition sets that have alrea een processe for previous state pairs represents a simple ut ver effective optimization. In orer to recognize an skip state pairs whose transition sets have alrea een processe, the propose algorithm exploits the aove mentione sorting of state pairs accoring to the complexit of their transition sets. However, in orer to guarantee that state pairs with ientical transition sets are ajacent, this sorting has to e refine introucing a seconar sorting criterion. The criterion use is not relevant, provie that a preceence relationship is efine for an two ifferent transition sets. A simple wa is to represent transition sets as strings of alphaeticall orere lists of state pair ranges (e.g. {(a f,a f),(a f,i k)} ), an to use the alphaetic orer of transition sets as the seconar sorting criterion. With state pairs sorte in this wa, the algorithm compares the transition set of the n-th state pair to the one of the (n-)-th state pair an, if the are foun to e the same, it skips the remapping step. Using a O(n log(n)) sorting algorithm, the asmptotic time complexit of the sorting phase is Q R log( Q R), which is negligile compare to the asmptotic time complexit of the algorithm that generates the equivalence classes. This stems from the fact that L s, which is the most critical factor in the formula, cannot e smaller than R. VI. MULTI-MAP ALPHABET COMPRESSION Although alphaet compression is usuall ver efficient in terms of smol compression ratio, it is not alwas sufficient to enale further strie ouling steps with the current harware. The inailit to further reuce the alphaet size mainl epens on the necessit to create aitional equivalence classes when transition sets overlap: for instance, in Figure three equivalence classes are neee to represent the smol pairs of two transition sets ue to the overlap of their laels As it woul e ifficult to further increase the compression ratio of the current alphaet compression technique, another wa to enale further strie ouling is to increase the egree of parallelism of the packet processing prolem. Let us enote N the numer of input smols to e processe an T the time taken to process them with a single processor. The iea is that if we are unale to process one string of N/ smols in a time T/ with one processor ecause strie ouling is unfeasile, we ma manage to process two ifferent strings of length N/ in T/ using two processors in parallel. In orer to split the matching prolem into two separate prolems, the transition sets of the NFA are partitione into two isjoint groups in such a wa that the amount of overlapping transition sets is minimize in each group. Then, the alphaet compression algorithm is applie separatel on each group of transitions, generating two translation ictionaries, i.e. two target alphaets an two translation maps, instea of a single one. This operation is likel to e feasile even if alphaet compression is unfeasile on the whole transition sets. The two ictionaries are then use to uil a NFA in which the two ifferent alphaets coexist: some transitions are laele the smols of the first ictionar, while the others are laele the smols of the secon one. If the transition sets are partitione into the two groups minimizing the numer of overlaps, each group can e compresse more efficientl than the original set, hence reucing the overall numer of generate smols. This more efficient compression originates a simpler NFA (i.e. with fewer transitions) than the one generate using the normal alphaet compression technique. As fining the optimum partitioning is NP-complete, a heuristic is use. Because the new NFA contains smols elonging to two ifferent alphaets at the same time, oth the input translation an the matching algorithm change: input translation generates

9 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 8 two strings, one for each ictionar, an the two resulting strings are then processe in parallel: a first process consiers onl smols elonging to the first alphaet while the secon one consiers onl smols elonging to the secon alphaet, ut oth processes upate the same set of active states. In this wa, the set is upate correctl, since in each state each possile transition of the original NFA is either applie the first or the secon process. From a practical point of view, using more processing units to perform the matching is not a prolem as the target harware is a GPU with hunres of availale processing elements. Moreover, each processing unit works on a NFA with, in average, half of the transitions per smol, thus hopefull halving the per-smol processing cost of each unit. At the same time, having simpler NFA makes it more likel to e ale to complete an aitional strie ouling step. The moifie compression algorithm (Algorithm ) is similar to the one presente in Section V, ut uses two maps, map an map. For each state pair, the algorithm uses the map on which the upate causes the generation of the smallest numer of new equivalence classes. This is achieve first upating oth maps (ApplSmols correspons to lines 5- of Algorithm 3, with the aition of the recoring of the ol state of the moifie cells of the map) an then keeping onl the one that has reache the smallest size (Uno restores the map to its previous state). Algorithm The multimap alphaet compression algorithm. : map[ Σ ][ Σ ] = ; map[ Σ ][ Σ ] = : size = ; size = 3: for all q Q o : for all q Q o 5: ApplSmols(map, size, q, q) 6: ApplSmols(map, size, q, q) 7: if size < size then 8: Uno(map, size, q, q)} 9: else : Uno(map, size, q, q) : en if : en for 3: en for The gree algorithm use for the selection of the map has een chosen after having consiere ifferent possile alternatives: GRASP (Gree Ranomize Aaptive Search Proceure), which alternates gree optimizations to ranomizations in orer to overcome local minimums; Top-Down, similar to the algorithm propose in [9], which creates a ifferent map per each transition an then it works merging the maps an minimizing the amount of aitional generate smols at each pass; Top-Down, which as a ranomization pass to the previous technique to avoi local minimums. The consiere algorithms have een evaluate in terms of oth time an compression efficienc, appling them to the NFA otaine using the same real rule sets use for our experiments (see section VIII for etails). As the gree algorithm is the simplest one, it is also the most time efficient. For what concerns compression efficienc, the results TABLE II ALPHABET SIZE WITH DIFFERENT MULTIMAP HEURISTICS. Rule set No compress. Gree GRASP Top-own Top-own FTP SMTP HTTP ET HTTP Full in Tale II show that no heuristic outperforms the others in all cases. The gree algorithm represents the est choice ecause it comines est time efficienc with goo compression results on average. The multi-map algorithm requires approximatel twice the time an memor compare to the original algorithm, ecause of the necessit to upate two maps instea of a single one. However, thanks to the etter compression provie, susequent strie ouling steps will e performe faster an with less memor compare to the original technique. A. Run-time packet processing Even if, in theor, multi-map alphaet compression shoul e ale to greatl improve the maximum achievale strie level an the processing throughput, appling further strie ouling an alphaet compression iterations to an NFA that has een compresse with the multi-map algorithm causes aitional issues that must e properl hanle, an that can e explaine taking into account the example in Figure 5. When a - map alphaet compression is performe on the uncompresse NFA of Figure 5a, the transition from q to q an those from q to q are assigne to the first support map, while the transition from q to q 3 is assigne to the secon support map. The support maps generate in this wa are shown in Figure 5e, while the compresse NFA is shown in Figure 5. If the input string is a,,c,, the following two strings are generate appling the two translation ictionaries:, ( using the first map), an, ( using the secon map). The packet processor has to concurrentl process smols an at the first iteration, an smols an at the next one. At the first iteration, if the active state set inclues onl state q, the transition from q to q is triggere smol, thus aing q to the new active state set. Then, at the next iteration, the active state set inclues onl q an oth the transitions from q to q 3 an from q to q are triggere smols an respectivel. Thus, the final active state set inclues exactl states q 3 an q, as expecte. If a new strie ouling step is performe, e.g., to get the NFA of Figure 5c, the presence of two translation maps complicates the input string translation an the matching algorithm. In fact, let us assume we appl exactl the same algorithms explaine aove. The input string a,,c, is translate into the two strings, an,, using the two maps. Then, the matching algorithm consumes the pairs of smols, an, concurrentl. If initiall the onl active state is q, the pair, triggers the transition from q to q, while the pair, oes not trigger an transition. Thus, the final active state set inclues onl state q, which

10 Smols (#) Active states itmap infant threa threa infant IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 9 is wrong, the reason eing ecause the transition, has not een fire while it shoul have een. The particularit of the smol pair, is that its two components elong to ifferent translation maps while our translation algorithm onl generates two translations: one mae onl of smols of the first alphaet, another mae onl of smols of the secon alphaet. In orer to make sure that we get the right result of matching at strie level, it is then necessar to generate more translations of the input string, incluing all the possile cominations of the two translation maps. In our example this implies generating translations, i.e. the strings,,,,, an,, where the first string is otaine using the first map for oth smols in each pair, the secon one is otaine using the first map for the first smol an the secon map for the secon smol in each pair, an so on. Appling again the -map alphaet compression algorithm on the NFA, we get two new translation maps, shown in Figure 5f, an the new compresse NFA shown in Figure 5. Having introuce two new alphaet translations, the input strings we ha for the uncompresse NFA have to e translate into 8 strings: otaine appling the first new translation map an otaine appling the secon one. Taking again the aove example, the string, is translate the leftmost map of Figure 5f into while the remaining strings are translate into. Using the rightmost map, the same four strings are translate into z (generate from string,), an three equal smols,,. In general, our approach increases the compression rate ut the numer of strings to e processe in parallel is multiplie the numer of maps (with maps it is oule) at each alphaet compression step. At each strie ouling step, instea, the numer of strings to e processe in parallel is square, an the length of the input strings is halve. Then, after n comine strie ouling an alphaet compression steps, the numer of strings that we must process in parallel ecomes S n = m n where m is the numer of maps use for alphaet compression, an the length of the input strings ecomes N/n, where N is the original length of the input string. However, increasing the numer of input strings to e processe in parallel is not necessaril an issue: it just makes the prolem more parallel. Furthermore, even if some threas are waste, the GPU coe can minimize this overhea greatl reucing the total amount of memor accesses, which is the actual main ottleneck of infant. In fact, we can note that multiple occurrences of the same smol in ifferent strings at the same offset are reunant ecause the lea to the same target states. Detecting reunant smols an avoiing their processing is a wa to reuce memor accesses. Reunancies happen frequentl an can e oserve even in our simple example: of the 8 -smol strings generate from the translation of the original -smol input, 6 strings are ientical ( ). Hence, five of them coul e ignore. B. Reunanc in input string translations Experimentall, it is eas to show that the phenomenon of smol reunanc occurs with an rule set, an that, after a certain strie level, reunant smols ten to increase. The (a) Uncompresse NFA a, c, q q q3 () Compresse NFA, c,[ z], z q q q3 (c) Uncompresse NFA q q q3 () Compresse NFA q q q3 (e) Translation maps at first strie iteration a c a c (f) Translation maps at secon strie iteration z Fig. 5. Multi-map compression of a sample NFA at x an x strie levels FTP SMTP x x x 8x 6x Fig. 6. Amount of non-repeate smols present in ranom input ata streams when translate to e processe with -map multistrie NFA experiment consists of generating the NFA at ifferent strie levels for the consiere rule set, using the strie ouling an the -map alphaet compression algorithms propose here. Then, various pseuo-ranom input packets are generate, an, for each strie level the are translate using the ictionaries originate the alphaet compression, as previousl explaine. Then, the input strings so generate for each strie level are reuce removing reunant smols (i.e., smols that are equal to other smols occurring in other strings at the same position). For example, if we have the two strings,,3,, an,,3,3, we can remove the smols in ol as the are the repetition of smols occurring at the same position in the other string, hence leaving onl the meaningful smols. Finall, the total length of all the reuce strings, which represents the total numer of non-reunant smols, is compute. Figure 6 plots the results of this experiment using 5 tes packets an the rule sets use the popular NIDS to analze FTP an SMTP packets (each plot refers to a single rule set). These rule sets have een chosen ecause of the possiilit to reach high strie levels thanks to their reuce size, consisting in aout simple regular expressions (more etails on the rule sets will e presente in Section VIII). In asence of smol uplications there woul e no length change when moving from x to x (ecause strings of length N/ each are generate), ut there woul e a length increase of two times when moving from x to x (ecause 8 strings of length N/ each are generate), an increase of 6 times when moving from x to 8x (ecause 8 strings of length N/8 each are generate) an so on. In Figure 6, instea, we can see that from x to x the ata increase is I x I t t5 (a) R a

11 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 (a) Uncompresse NFA (e) Translation maps at first strie iteration a, c, q q q3 a c a c wa less than the ouling that woul occur in the asence c,[ z] of smol uplications, while at further strie levels there is even() acompresse linear reuction NFA of the total numer of smols rather thanqthe exponential q growth q3 that woul occur in the asence of smol uplications. Intuitivel, (c) Uncompresse as the strie NFA level increases, (f) Translation overlapping maps transition sets that nee to e separate inat orer secon tostrie avoi iteration generating, extraq equivalence q classesq3 are less likel to occur. This happens ecause at each strie level transition sets are z compresse, into fewer single classes an, at the same time, the size of () Compresse NFA the support maps increases. For this reason, support maps q q q3 ecome increasingl less populate (particularl the seconar ones), an the amount of input string patterns with several z ifferent possile translations ramaticall ecreases at each strie level. This tren can also e unerstoo consiering the asmptotic ehavior: in the hpothesis that we can compute the k -strie NFA for aritraril large values of k using -map compression, we woul reach a limit situation where the input string is translate into strings of length. In this extreme case, if N is the numer of accepting states (which tpicall correspons to the numer of regular expressions in the rule set), we woul necessaril en up with N + meaningful smols, each one leaing from the initial state to one of the accepting states, plus an extra smol representing all the other cases. Hence, asmptoticall, the numer of meaningful smols tens to N +, where N is the numer of accepting states of the NFA. Consequentl, as the rule sets use in the experiments presente in Figure 6 are compose of aout regular expressions each, it can e expecte that the numer of meaningful smols to e processe will ten to aout, as well as the alphaet size. This explains the reuction that we can alrea oserve starting from the 8-strie NFA (where 8 strings are generate when translating each input string). The ehavior in the transient phase, represente the first strie levels, as well as the maximum numer of useful smols, epens on the complexit of the rule set taken in consieration. However, in our test cases, all the rule sets for which it was possile to go eon the x arrier have generate a huge amount of reunant, useless smols. Base on these consierations, infant has een optimize to take avantage from this phenomenon an avoi to process useless smols, with the consequence that it can achieve high performance at an strie level, as shown in the next section. Finall, it is worth noting that the maximum numer of useful smols to e processe oes not epen on the input string: there is, in fact, an upper oun that epens onl on the rule sets, thus making it impossile to forge a malicious incompressile input ata string. VII. INFANT IMPROVEMENTS After presenting the algorithms that improve the generation of multi-strie NFA, which are inepenent from the actual regex processor, this section presents the aaptations of in- FAnt to the aove mentione algorithms. The moifications were esigne to maintain the excellent properties of the original implementation in terms of (limite) coe ivergence an coalesce memor access patterns. I x I q x II II t t t3 t t5 t6 t7 t8 'z' Input # t9 t t t t3 t t5 t6 I a I t7 t8 t9 t t t t3 t Group Group Threa Block (3 threas) a II II Input # t5 t6 t7 t8 t9 t3 t3 t3 Fig. 7. Threa task scheuling uring the processing of input messages translate in ifferent strings each (a) Regexp () NFA (c) -Strie NFA a,c, 55 a*c a c a, q q q q q 55,a q,c c, q3 55 q3 55, 55, nviia GPUs are compose of multiple processing units, each one ale to execute one or more threa locks in parallel, with 3 to threas per lock. The original version of infant followe this architecture closel, assigning a ifferent network packet to each lock an using one threa per transition in the computation of the next active states. However, when operating at high strie levels this choice ecomes no longer aequate ecause the average numer of transitions per smol tens to ecrease, often ecoming much lower than 3. In this case, the architectural constraint of having a minimum of 3 threas per lock implies that some threas will remain ile, thus wasting resources. For this reason, our new threa hierarch efines a threa group that oes no longer coincie with the harware threa lock. Particularl, each threa lock is logicall partitione into several threa groups, each one processing a ifferent input string, which comes from either ifferent network packets, ifferent maps, or oth. This re-organization of threas not onl enales a reuction of the numer of inactive threas (achieve also in [6] an [7]), ut also helps to etter manage the parallel processing of the several string translations that originate from multi-map compression. In the new algorithm, the threas of each group cooperativel process all the translations of a single input string. Figure 7 shows an example where a 3-wie threa lock has een logicall partitione to process two packets, each one translate into two ifferent input strings ecause of the NFA with -map compression, hence resulting in istinct threa groups (8 threas wie) running in parallel. In that picture, each te from each input packet (i.e., packets Input # an Input #) is in fact translate into two ifferent smols (e.g., x an x for the first packet, a an a for the secon), which represent the actual input for a ifferent threa group. In this wa we maintain the parallel processing of the transitions associate to each input smol (in the example, each smol is processe eight threas), as in the original infant version, ut reucing the numer of useless threas. Thanks to the nviia GPU harware architecture, processing ifferent strings in the same threa lock also enales a reuction of memor accesses. In fact, as iscusse in Section VI, the proailit that two or more threas have to process the same smol at the same time in the same threa lock is ver high, ue to the smol reunanc phenomenon iscusse in Section VI-B. When this happens, the memor management unit of the GPU can recognize the presence of memor requests for the same location an join them together, thus reucing the real amount of accesses to the main memor. q a

12 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 TABLE III THE RULE SETS USED FOR TESTING. Tpe Name Rules Size States TpS Thr (#) (KB) (#) (#) (Mpps) Small FTP SMTP Meium HTTP Large ET HTTP Full Thanks to this feature, as far as we manage to process all the translations of an input string threas elonging to the same lock, the amount of actual memor accesses correspons to the numers plotte in Figure 6. This means that even if the numer of smols that must e processe increases at each strie ouling, the coe can actuall get ri of this (theoretical) aitional loa. However, this is possile as far as the numer of translations is small enough to fit into a single lock. In orer to make this possile in a roaer range of cases, an aitional optimization has een ae to the coe that performs input translation: the ata translator compresses its output completel ropping the strings that contain onl reunant smols. This trivial compression mechanism ramaticall reuces the numer of translations that must e processe for each input string. This was confirme our experiments: even at the highest strie levels an with the most complex rule sets, we never ha to process more than 6 translations per input string. This reuction of the numer of translations also reuces the total numer of threas require, which contriutes to increase the scalailit of the solution, ecause the maximum numer of threas is an harware constraint. Moreover, if the numer of strings to process increases too much it is even possile that the ata transfer etween the CPU an the GPU ecomes the main ottleneck of the sstem, thus limiting the maximum achievale throughput. VIII. EXPERIMENTAL EVALUATION The experimental evaluation of our algorithms has een carrie out on an Intel i7-96 qua core workstation running at 3.Ghz, GB RAM with an nviia Tesla c5 GPU. Tale III lists all the rule sets use in our tests along with their size, average numer of transitions per smol (TpS), an average throughput as otaine vanilla infant, without multi-striing. The throughput has een measure taking into account onl the GPU kernel execution time; more etails aout the overall processing costs will e shown in Section VIII-D. Rule sets have een classifie as small, meium or large, epening on the complexit of the generate NFA; some have een extracte from the an EmergingThreats (ET) commercial ruleset ataases. Full inclues the full rule set, with the onl exclusion of the rules that have no stanar PERL sntax. 53 is a selection of 53 rules use as enchmark in [3]. The other an ET rule sets have een uilt selecting onl the rules of a single protocol, specifie in the name of each rule set. A. Multi-strie NFA uiling process This section compares the performance of our algorithms for uiling multi-strie NFA (name new in the single-map form an -new in the -map form) with the state-of-theart algorithms Becchi an Crowle [], name TACO3, implemente ase on the pseuo-coe presente in []. Other algorithms presente in Section II, such as [], [8], [], are so inefficient that often the cannot even complete our tests an hence the have not een consiere in our analsis. In orer to evaluate the contriution of the state pair sorting an the consequent elimination of the reunant transitions (Section V-A), we present also the results otaine with a version of our single-map algorithms (name unopt ) that oes not inclue this optimization. The performance of uiling the multi-strie NFA has een evaluate measuring the total time an the maximum amount of memor taken each algorithm on our rule sets. The time taken for - an NFA have een measure separatel. Measurements have een repeate in orer to otain statistical significance. Results, plotte in Figure 8, have een normalize taking the performance of the TACO3 algorithm (execute with a preliminar compression of the initial alphaet, as propose in []) as reference. Vertical lines represent the min an max otaine values while rectangles represent the average value plus an minus stanar eviation. Figure 8 shows that the various algorithms ehave nearl the same when consiering the level, particularl with respect to the memor requirements (Figure 8). The reason is that the size of the most critical memor structure (the support map of size Σ ) is negligile compare to the memor require the software to loa (e.g., run-time liraries). The situation changes when moving to strie level (Figures 8c,), where unopt is, on average, times faster than TACO3. Aing state pair sorting an reunant string elimination further reuces the time another orer of magnitue, while the -map compression is more than three orers of magnitue faster than TACO3. Memor consumption improves consieral as well: the new algorithm halves the requirements compare to TACO3; the theoretical x improvement (our algorithm uses one fourth of the memor use TACO3 for each item in the map) cannot e reache ecause of the aitional ata structures neee the algorithms, whose occupanc is not negligile in the case of simple NFA. Instea, multi-map alphaet compression greatl reuces the alphaet size, thus ecreasing the size of the support structures as well. Figure 9a shows the results of another experiment that has een performe to evaluate the maximum strie level that can e otaine in a -hours time span. For a given algorithm an rule set, the experiment consists of appling the algorithm to iterativel oule the strie level of the NFA until the hours limit is reache. The maximum otaine strie level is shown in the graph. In four cases out of six new outperforms TACO3, while -new oes even etter ecause it iterates at least one time more than TACO3 with all the rule sets, an even more times in the simplest cases. One of the motivations for the run-time processing improvement when using multi-map alphaet compression is the

13 8-strie 8-strie 8-strie 8-strie Compression efficienc Throughput (Mps) Compression efficienc Throughput (Mps) Sm IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 TACO3 unopt new -new x x x TACO3 unopt new -new 8x 6x. TACO3 unopt new -new TACO3 unopt new -new (a) NFA (time) () NFA -map (memor) (translation).75 -Map (Meium) -map (processing) -Map (Big) Map (Small).3 -map (translation) Map (Meium). -map (processing) Map (Big).... TACO3 unopt new -new TACO3 unopt new -new TACO3 unopt new -new. TACO3 unopt new -new n/a n/a ρ x x 8x Fig. 8. Comparison of total require time an maximum require memor when generating 5 an NFA for all rule sets -Map (Small) -map (translation) (a) Max reachale multi-strie level ρ () NFA -Map (memor) (Small) 9 x 8 x 7 x 6 x 5 x x 3 x x x x -Map (Meium) () -Map Max (Big) multi-strie throughput oost -Map (Small) 3 -Map (Meium) -Map (Big) ET HTTP (a) Max reachale multi-strie level () Max multi-strie throughput ET HTTP oost ET (c) Average transitions per smol FTP SMTP HTTP 53 5 FTP SMTP HTTP 53 5 FTP SMTP HTTP 53 HTTP 5 8 TACO3 New -New 9 x New -New x x x 8x 6 8 x 7 x Fig. 9. Per-ruleset an per-algorithm performance comparison: 6 x (a) maximum strie level reachale in a -hour timespan, () maximum throughput oost of infant an (c) average amount of transitions per smol 5 x 8 x 6 3 x x x expecte n/a n/a n/a reuction of the average amount of x transitions per of Figure 9. B comparing them with the ones reporte in smol when the strie level ET HTTP increases. This value greatl influences FTP the SMTP average HTTP processing 53 cost 5 per smol FTPin infant TACO3 New -New an, as escrie in Section VI, it is expecte to e at least halve at ever iteration of the multi-striing algorithm. This reuction has een experimentall oserve on the selecte rule sets, as shown in Figure 9c, which reports this numer at ifferent strie levels for various rule sets. This reuction is alwas present at all strie levels an with an rule set; even, this effect seems to e amplifie with ver large rule sets. For example, 5 achieves a reuction rate >8%. B. Run-time traffic processing SMTP HTTP New This section analzes the run-time performance in regex matching that can e achieve using the new algorithms. The throughput of the new infant was measure for each rule set with the NFA otaine the new an -new algorithms, using a set of real traffic captures taken from our Universit network. TACO3 has een omitte as its NFA coincies with new. Each result has een normalize with respect to the throughput otaine using infant with the -strie NFA for the same rule set, which was reporte in the last column of Tale III; the resulting speeups are shown in Figure 9. The tests were repeate also with two tpes of snthetic traffic, the first one mae of packets fille with ranoml generate paloas, an the secon one uilt as a worst case, maximizing the numer of active states generate uring processing. The measure throughput was rather stale, resulting (in the worst case) in a % worsening compare to the real traffic. For the sake of revit, we reporte in Figure 9 onl the results otaine with real traces. In this experiment, the NFA with the largest strie level that coul e compute an loae in the GPU has een use for each ruleset. These strie levels are inicate in the columns 5 (c) NFA (time) -map (processing) (c) Average transitions per smol -map (translation) -map (processing) n/a n/a x x 8x n/a n/a n/a Figure 9a, ET HTTP it is possile to notice that often the NFA ET with the maximum 53 strie 5 level that canftp e compute SMTP HTTP in53 a HTTP hours 5 time -New x x x 8x span is too large to e loae in the GPU. As alrea note in [], a x throughput oost can e expecte at each strie ouling. Figure 9 sustantiall confirms the aove finings, showing that there are cases in which the oost is even higher an cases for which it is slightl lower. The results also show that, although with - new the throughput oost at each strie ouling is usuall slightl less than x, the overall speeup is almost alwas greater than that achievale with -new. The onl exceptions are HTTP an 5, for which the same oost is achieve with -new an -new, which is ue to the memor limitations of the GPU which prevent the largest NFA from eing loae. However, this prolem shoul isappear with future generations harware, as GPU venors are constantl increasing the amount of memor on their vieo oars. Divergence, as measure the nviia profiler, is usuall elow 5% an never excees %. C. Multi-map alphaet compression efficienc As the NFA generate the multi-map algorithm are completel ifferent from the ones otaine with the other techniques, aitional tests have een performe in orer to show that the enefits of multi-map o not epen on the rule sets. This is important in orer to guarantee that the efficienc of alphaet compression cannot e compromise malicious rule sets, carefull create in orer to ecrease the throughput of the pattern matching algorithm. These aitional tests exploit ranoml generate rule sets, otaine using a generator that extens the algorithm escrie in [8] with the possiilit to generate neste regular expressions such as a(c. )+, thus ieling more realistic

14 8-strie 8-strie Compression efficienc Throughput (Mps) Throughput (Gps) Latenc (s) IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 TACO3 unopt new -new TACO3 unopt new -new. TACO3 unopt new -new TACO3 unopt new -new ρ (a) Max reachale multi-strie level 9 x 8 x 7 x 6 x 5 x x 3 x x x x -Map (Small) -Map (Meium) -Map (Big) -Map (Small) -Map (Meium) -Map (Big) Fig.. Compression efficienc of -map an -map alphaet compression rule sets. Moreover, this generator inclues a parameter (ρ) that controls the percentage of wilcars that occur in the generate expressions. With ρ =, pure plain sequences of characters (or sets of characters) are generate, while with ρ =.5 each generate character (or set of characters) is associate to one of the possile wilcars, like repetition operators such as +,, {a, } or the optionalit operator ET HTTP?. Three FTP ifferent SMTP HTTP classes 53 of rule 5 sets have eenftp generate: TACO3 New -New small rule sets, mae regular expressions of characters each, meium rule sets, mae 5 regexps of 5 characters each, an large rule sets, mae regexps of characters each. For each class, rule sets with ifferent values of ρ have een generate in orer to stu the compression efficienc when varing this parameter. For each generate rule set, the NFA has een uilt using oth new an -new. Then, for each NFA a compression efficienc ratio has een calculate as C eff = ( Σ Σ )/ Σ, where Σ an Σ are the alphaet sizes of the NFA measure efore an after compression ( means that the alphaet size has een reuce to a negligile value, while means that the compressor was unale to reuce the alphaet size). For each value ρ, rule sets have een generate for each class an the average compression efficienc has een plotte in Figure. The graph shows that the efficienc of the -map algorithms (ashe lines) is alwas etter than that of the -map algorithms (continuous lines). Moreover, it clearl shows that, even if with meium or large rule sets the efficienc rops to with complex rules (i.e., when ρ increases), multi-map compression can support more complex rules than its single-map counterpart. D. Sstem-wie run-time processing The last tests aim at assessing run-time performance from a sstem-wie perspective, hence consiering that packets coming from the network are (i) uffere, (ii) translate accoring to the multiple maps in use, (iii) transferre to the GPU through the PCI Express us, an finall (iv) processe the NFA algorithm. Figure (a) shows the sstem throughput (ars) an the maximum latenc experience each packet (line) with atches of ifferent sizes. Experiments use the real traffic traces an the 53 ataset, with maps. The use network is a Gigait Ethernet. Although the throughput is maximum with the largest atch, it ecreases slowl with smaller atches, while at the same time latenc improves consieral. For instance, a atch of packets achieves 9% of the throughput of the largest atch ( () Max multi-strie throughput oost Throughtput vs. latenc,,3,,5 n/a n/a Receiving ata Input translation (for multimap) 8 x5 x 8x Transfer (RAM --> GPU) Numer of packets in a single atch GPU processing (a) Fig.. (a): Throughput (ars) an latenc (line) with ifferent sizes of the processing atch; (): latenc reakown in case of a atch of 5 packets. n/a n/a n/a can achieve high throughput while keeping latenc small. SMTP HTTP 53 ET HTTP 5 FTP SMTP HTTP 53 ET HTTP 5 New -New x x x 8x,8,6 Latenc reakown -map (translation) (sec),675 -map (processing) -map (translation),675 -map (processing) (c) Average transitions () per smol packets), ut latenc rops from 96 to 9 ms. This confirms that our sstem oes not trae throughput for latenc an it Figure () explores the contriution of the aove steps to the overall latenc. While, efinition, uffering an GPU processing take the same time ecause we fee our sstem with the maximum amount of ata that can e processe, the CPU-to-GPU transferring time is negligile in all conitions. Instea, translation time, which increases with higher strie levels, can contriute significantl to the overall latenc an it can also potentiall limit the throughput when its uration excees the GPU processing time (as in some of our 8x tests), as the usual parallelization strateg (i.e., the CPU translates the n-th packet while the GPU analzes the (n-)-th packet) is no longer enough. However, the smol translation coe use in the experiment coul e greatl improve: features like the ata compression presente in Section VII can reuce the gloal amount of memor accesses, thus increasing the translation throughput; furthermore several CPU cores can e use to translate (i) multiple packets in parallel, (ii) multiple maps in parallel, an (iii) even partition the same input packet on ifferent cores, as the translation process is stateless. Finall, it can e note that the throughput of the translation algorithm epens mainl on the numer of memor lookups, which can e safel assume having a fixe cost ecause ictionaries are so large to prevent the CPU to cache them. Since the amount of memor accesses is onl affecte the numer of use ictionaries an not their size, the cost of the translation epens onl on the strie level. Consequentl, the NFA (hence the rule set) has almost no influence on the translation process. In conclusion, the translation process oes not represent a prolem for the time eing, while several possile improvement strategies are availale for the future. IX. CONCLUSION Multi-striing, a well-known technique to improve the efficienc of regex matching, presents some critical issues in the process of uiling the multi-strie automaton. This paper has presente new algorithms to overcome these limits, focusing on NFA-ase regex matching. First, a more efficient set of algorithms (in terms of processing time an memor) to perform strie ouling an single-map alphaet compression has een propose. Secon, a new algorithm for alphaet compression

allowe multi-map compression in the NFA matching at run-time.

reach higher strie levels. This result, couple with the more efficient run-time matching algorithm on GPUs, iels faster processing of network traffic.

15 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. X, NO. X, X 5 ase on the multi-map technique has een presente, along with a companion GPU-ase matching algorithm that makes it particularl effective ecause of the capailit to etter exploit the parallelism allowe multi-map compression in the NFA matching at run-time. The experimental results show that the new algorithms for uiling multi-strie NFA outperform the existing ones in terms of oth time an memor, hence enaling either to hanle more complex rule sets or to reach higher strie levels. This result, couple with the more efficient run-time matching algorithm on GPUs, iels faster processing of network traffic. Furthermore, tests with ranoml generate an worst case patterns show that the results o not epen on the chosen rule sets nor are sensile to properl-crafte malicious traffic. Two future research irections are envisionale. First, a more efficient memor management in infant, mimicking the technique propose in [9], which enales the processing of larger NFA. This coul roaen the applicailit of our algorithms ecause we are ale to uil ver efficient multistrie automata ut we ma e unale to use them with the current harware as the o not fit in the (limite) memor size of current GPU cars. Secon, the translation process can represent another area of research, although with the current harware it is not et a prolem. [3] G. Vasiliais, M. Polchronakis, S. Antonatos, E. P. Markatos, an S. Ioanniis, Regular expression matching on graphics harware for intrusion etection, in proceeings of RAID 9. Berlin, Heielerg: Springer-Verlag, 9, pp [] L. Wang, S. Chen, Y. Tang, an J. Su, Gregex: Gpu ase high spee regular expression matching engine, in Proceeings of the Fifth International Conference on Innovative Moile an Internet Services in Uiquitous Computing, ser. IMIS. Washington, DC, USA: IEEE Computer Societ,, pp [5] M. A. Jamshe, J. Lee, S. Moon, I. Yun, D. Kim, S. Lee, Y. Yi, an K. Park, Kargus, a highl-scalale software-ase intrusion etection sstem, in Proceeings of the ACM Conference on Computer an Communications Securit, ser. CCS. New York, NY, USA: ACM,, pp [6] Y. Zu, M. Yang, Z. Xu, L. Wang, X. Tian, K. Peng, an Q. Dong, Gpu-ase nfa implementation for memor efficient high spee regular expression matching, in Proceeings of the 7th ACM SIGPLAN Smposium on Principles an Practice of Parallel Programming, ser. PPoPP. New York, NY, USA: ACM,, pp. 9. [7] X. Yu an M. Becchi, Gpu acceleration of regular expression matching for large atasets, exploring the implementation space, in Proceeings of the ACM International Conference on Computing Frontiers, ser. CF 3. New York, NY, USA: ACM, 3, pp. 8: 8:. [8] M. Becchi, M. Franklin, an P. Crowle, A workloa for evaluating eep packet inspection architectures. in Proceeings of the 8 IEEE International Smposium on Workloa Characterization. IEEE, 8. [9] X. Liu, X. Liu, an N. Sun, Fast an compact regular expression matching using character sustitution, in Proceeings of ACM/IEEE Smposium on Architectures for Networking an Communications Sstems,. REFERENCES [] N. Cascarano, P. Rolano, F. Risso, an R. Sisto, infant, nfa pattern matching on gpgpu evices, SIGCOMM Comput. Commun. Rev., vol., pp. 6, Oct.. [] B. Broie, D. Talor, an R. Ctron, A scalale architecture for highthroughput regular-expression pattern matching, in ACM SIGARCH Computer Architecture News, vol. 3, no., Ma. 6, pp. 9. [3] M. Becchi an P. Crowle, An improve algorithm to accelerate regular expression evaluation, in Proceeings of the 3r ACM/IEEE Smposium on Architecture for networking an communications sstems. ACM, 7, pp [], A-fa, a time- an space-efficient fa compression algorithm for fast regular expression evaluation, ACM Trans. Archit. Coe Optim., vol., no., pp. : :6, 3. [5], Efficient regular expression evaluation, theor to practice, in Proceeings of the th ACM/IEEE Smposium on Architectures for Networking an Communications Sstems. ACM, 8, pp [6] N. Yamagaki, R. Sihu, an S. Kamia, High-spee regular expression matching engine using multi-character nfa, in International Conference on Fiel Programmale Logic an Applications, 8, pp [7] M. Avalle, F. Risso, an R. Sisto, Efficient multistriing of large noneterministic finite state automata for eep packet inspection, in IEEE International Conference on Communications,, pp [8] L. Yang, R. Karim, V. Ganapath, an R. Smith, Fast, memorefficient regular expression matching with nfa-os, Computer Networks, vol. 55, no. 5, pp ,. [9] S. Kong, R. Smith, an C. Estan, Efficient signature matching with multiple alphaet compression tales, Proceeings of the th international conference on Securit an privac in communication networks - SecureComm 8, p., 8. [] C. Meiners, J. Patel, E. Norige, E. Torng, an A. Liu, Fast regular expression matching using small tcams for network intrusion etection an prevention sstems, in Proceeings of the 9th USENIX conference on Securit. USENIX Association,, pp [] N. Hua, H. Song, an T. V. Lakshman, Variale-strie multi-pattern matching for scalale eep packet inspection, IEEE INFOCOM 9 - The 8th Conference on Computer Communications, pp. 5 3, 9. [] R. Smith, N. Goal, J. Ormont, K. Sankaralingam, an C. Estan, Evaluating GPUs for network packet signature matching, in Proceeings of ISPASS 9, 9, pp Matteo Avalle (matteo.avalle@polito.it) grauate in Computer Engineering in from Politecnico i Torino, Turin, Ital, an receive a Ph.D. egree in Computer Engineering from the same universit in. His research interests inclue formal methos applie to securit protocols an parallel/istriute computing applie to network packet processing an traffic analsis. He is now Chief Technical Officer at Fluentif LTD. Fulvio Risso (fulvio.risso@polito.it) grauate in Computer Engineering in 995 an receive the Ph.D. egree in Computer Engineering in, oth from Politecnico i Torino, Torino, Ital. He is currentl assistant professor with the Department of Control an Computer Engineering of Politecnico i Torino. His current research activities focus on efficient ata processing applie to traffic analsis, ata plane packet processing an software-efine networks. Riccaro Sisto (riccaro.sisto@polito.it) receive the M.S. egree in Electronic Engineering in 987, an the Ph.D. egree in Computer Engineering in 99, oth from Politecnico i Torino, Torino, Ital. Since 99 he has een working at Politecnico i Torino, where he is currentl Full Professor of Computer Engineering. His main research interests are in the area of formal methos applie to software an communication protocols, istriute sstems, an computer securit.

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that