Regulr Expression Mtching with Multi-Strings nd Intervls Philip Bille Mikkel Thorup
Outline Definition Applictions Previous work Two new problems: Multi-strings nd chrcter clss intervls Algorithms Thompson s lgorithm with multi-strings. Decomposition-bsed lgorithms with multi-strings. Chrcter clss intervls extensions.
Regulr Expressions A chrcter α is regulr expression. If S nd T re regulr expressions, then so is The union S T The conctention ST (S T) The kleene str S*
Lnguges The lnguge L(R) of regulr expression R is: L(α) = {α} L(S T) = L(S) L(T) L(ST) = L(S)L(T) L(S*) = {ε} L(S) L(S) 2 L(S) 3
Exmple R = (*)(b c) L(R) = {b, c, b, c, b, c,...}
Regulr Expression Mtching Given regulr expression R nd string Q the regulr expression mtching problem is to decide if Q L(R).
Applictions Primitive in lrge scle dt processing: Internet Trffic Anlysis Protein serching XML queries Stndrd utilities nd tools Grep nd Sed Perl
Previous Work (Worst-Cse Efficient Algorithms) Let R = m nd Q = n. Stndrd textbook lgorithm [Thompson 1968] simultes non-determinstic utomton (NFA) in O(nm) time. NFA-decomposition lgorithms [Myers 1992], [B 2006], [B,Frch-Colton 2005], [B, T 2009]: Decompose NFA into tree of smll NFAs nd combine with tbultion nd/ or word-level prllelism to speedup Thompson s lgorithm. We will need O(n (m log w/ w + log m)) time lgorithm [B 2006] for our results. Fstest known lgorithm for lrge w.
Problem 1: Multi-Strings Mny regulr expressions consist k << m strings. Exmple: Gnutell downlod strem detection: (Server: User-Agent:)( \t)*(limewire BerShre Gnucleus Morpheus XoloX gtk-gnutell Mutell MyNpster Qtell AquLime NpShre Combck PHEX SwpNut FreeWire Openext Todnode) k = 21 vs. m = 174. Cn we exploit k << m in lgorithms for regulr expression mtching?
Problem 2: Chrcter Clss Intervls For subset of chrcters C chrcter clss intervl C{x,y} represents string of chrcter from C of length t lest x nd t most y. Exmple: [fg]{13,42} Specil cse of gps ( Σ{x,y} ) is importnt in protein serching. We cn lwys convert chrcter clss intervl opertor to stndrd opertors but this increses the length of regulr expression by y. Cn we efficiently implement chrcter clss intervl opertors in regulr expression mtching?
Outline Definition Applictions Previous work Two new problems: Multi-strings nd chrcter clss intervls Algorithms Thompson s lgorithm with multi-strings. Decomposition-bsed lgorithms with multi-strings. Chrcter clss intervls extensions.
Thompson s Algorithm () α (b) N(S) N(T ) ɛ N(S) ɛ ɛ (c) ɛ N(T ) ɛ (d) ɛ N(S) ɛ ɛ Recursively construct non-deterministic finite utomton (NFA) from R.
Thompson s Algorithm R = (b )* b 4 5 6 b b Thompson NFA (TNFA) N(R) hs O( R ) = O(m) sttes nd trnsitions. N(R) ccepts L(R). Any pth from strt to ccept stte corresponds to string in L(R) nd vice vers. Trverse TNFA on Q one chrcter t time. O(m) per chrcter => O( Q m) = O(nm) time lgorithm. Cn we get O(nk)?
Thompson s Algorithm with Multi-Strings b 4 5 6 b 1 6 3 2 Construct pruned TNFA: Replce strings L = {L1,..., Lk} with single trnsitions => number of sttes nd trnsitions is O(k). Mintin FIFO bit queue for Li of length Li. Preprocess L for fst multi-string mtching (Aho-Corsick utomton).
Thompson s Algorithm with Multi-Strings b 4 5 6 b 1 6 3 2 Interleved trversl of TNFA nd multi-string mtching on one chrcter from Q t time: Strtpoint of string trnsition ctive => Enqueue 1 else 0. Front of queue 1 nd mtch of string => Mke endpoint ctive. O(k) sttes nd trnsition, k queues, multi-string mtching is fst => O(k) time per chrcter => Totl time O(nk + m log k) nd spce O(m).
Decomposition Algorithms We use NFA-decomposition lgorithm bsed on word-level prllelism [B 2006]: Simplifying ssumption: m w. Decompose TNFA into tree of O(m/w) micro TNFAs, ech with t most w sttes. Encode ech micro TNFA stte-set in O(w) bits. Micro TNFA trversl on single chrcter in O(log w) time using word-level prllelism. => O(m/w log w) on single chrcter for entire TNFA => O(nmlog w/ w) lgorithm for regulr expression mtching. Fstest known for lrge w.
Decomposition Algorithms with Multi-Strings Gol: Replce m with k. Process chrcter in O(k log w/w) time. Apply decomposition on pruned TNFA: Tree of O(k/w) micro TNFAs with t most w sttes nd w strings. Reuse ε-trnsition trversl => O(log w) per micro TNFA Reuse multi-string mtching lgorithm. The missing piece: How cn we mintin w bit queues in O(log w) time per opertion?
Cse 1: Short Bit Queues (length 2w) First, suppose ll queues hve the sme length! Represent queues verticlly. In ech step insert input bits in bck of queue nd output the front of the queue. Implicitly move ll bits forwrd by updting the pointer to the strt of the queues. => O(1) time per step.
Cse 1: Different lengths?
Cse 1: Short Bit Queues (length 2w) With bit msk nd stndrd bitwise opertion we cn implement ech jump point in O(1) time. => O(log w) time per step.
Cse 2: Long Bit Queues (length > 2w) Horizontl representtion with verticl front nd bck buffers of length w. Enqueue nd dequeue from buffers in O(1) time. Every w steps (full buffers): Trnspose the bck buffer nd insert into horizontl representtion. Trnspose the front w entries of the horizontl representtion nd insert into the front buffer. Trnspose tkes time O(w log w) [T 1997] => Amortized O(log w) time per step.
Algorithm Summry O(log w) per chrcter per micro TNFA => O(k \log w /w) per chrcter. => totl time O(n (k log w/w + log k) + m log k) nd spce O(m).
Chrcter Clss Intervls New technique to mintin w counters in prllel with reset nd decrement opertions. Combine with bit queues to support chrcter clss intervls.