Special lecture on Information Knowledge Network -Information retrieval and pattern matching- The 5th Regular expression matching Takuya kida IKN Laboratory, Division of Computer Science and Information Technology Special lecture on IKN 2017/11/22
Today s contents bout regular expression Flow of matching processing Construction of a parse tree for a RE Construction of a NF for RE matching How to simulate the NF? 2
What is regular expression? notation for flexible and strong pattern matching Console command example: rm *.txt cp Important[0-9].doc Grep search example: Match to any filename of.txt Match to Important0.doc ~Important9.doc grep E for.+(256 CHR_SIZE) *.c Matching script example on Perl: m ^http://.+.jp/.+$ Match to strings that start with http:// followed by.jp/ regular expression can express a regular set (regular language) = can express a language (set of strings) LL of which a finite automaton can accept 3
Definition of regular expression regular expression (RE) is a string over Σ {ε,,,,(,)} which is recursively defined by the following rules: (1) and any elements of Σ are REs (2) If αα and ββ are REs, then (αα ββ) is a RE (3) If αα and ββ are REs, then (αα ββ) is a RE (4) If αα is a RE, then αα is a RE (5) Only those derived from the above are REs 例 : ( (( T) (C G)) ) (T CG) Symbols,, and are called operator Symbol + is often used as αα+ = αα αα for RE αα αα ββ is abbreviated as αααα for convenience 4
Semantic of regular expression RE is mapped into a subset of Σ (Language LL) (i) = (ii) (iii) (iv) For any a Σ, a = {a} For any REs αα and ββ, (αα ββ) = αα ββ For any REs αα and ββ, (αα ββ) = αα ββ (v) For any RE αα, αα = αα For example: (a (a b) ) (a (a b) ) = a (a b) = {a} a b = {a} a b = {axxxxx a, b } n DF equivalent to the left example q 0 q 2 b a q 1 a,b Execise: how about (T G)(TT)*? a,b 5
What is the RE matching problem? Regular expression matching problem is the problem of finding any strings in LL αα = αα for RE αα from a text REs and finite automaton have the same ability to define languages We can construct a F MM that accepts language LL(αα) for RE αα We can also describe a RE αα that derives language LL(MM) for F MM refer to "utomaton and computability" (Sec. 2.5) by Setsuo rikawa and Satoru Miyano Create a DF/NF corresponding to a given RE and simulate the movement It is easier to convert to a NF than to a DF The pattern occurrences can be found when the F reaches to its final states while reading a text 6
Flow of matching process General flow NF construction by Thompson method parsing text scan RE Parse tree NF Report the occ. NF construction by Glushkov method DF Flow with filtering technique extraction multiple PM verify RE set of factors Find candidates Report the occ. 7
Construction of parse tree Parse tree: a tree structure used in preparation for making NF Each leaf is labeled by symbol a Σ or the empty word ε. Each internal node is labeled by xx {,, }. Ex) Parse tree TT RRRR for RRRR = (T G)((G ) ) (T G)((G )*) T G * G Depth Operator 1 2 8
Pseudo code Parse (p=p 1 p 2 p m, last) 1 v θ; 2 while p last $ do 3 if p last Σ or p last = then /* normal character */ 4 v r Create a node with p last ; 5 if v θ then v [ ](v, v r ); 6 else v v r ; 7 last last + 1; 8 else if p last = then /* union operator */ 9 (v r, last) Parse(p, last + 1); 10 v [ ](v, v r ); 11 else if p last = * then /* star operator */ 12 v [ * ](v); 13 last last + 1; 14 else if p last = ( then /* open parenthesis */ 15 (v r, last) Parse(p, last + 1); 16 last last + 1; 17 if v θthen v [ ](v, v r ); 18 else v v r ; 19 else if p last = ) then /* close parenthesis */ 20 return (v, last); 21 end of if 22 end of while 23 return (v, last); 9
Thompson s NF construction method Idea: K. Thompson. Regular expression search algorithm. Communications of the CM, 11:419-422, 1968. Construct NF TTT(vv) that accepts language LL RREE vv corresponding to the subtree with vv as the top while traversing parse tree TT RRRR in post order Each TTh vv is obtained by concatenating the automaton for the children of vv with ε-transitions Properties of Thompson NF: #states < 2mm, #transitions < 4mm O(mm) Contains many ε-transitions Transitions other than ε-transitions always are from ii to ii + 1 Ex) Thompson NF for RRRR = (T G)((G )*) 0 1 2 G T 3 4 5 6 7 8 G 9 10 11 12 13 14 15 16 10 17
NF construction algorithm For parse tree TT RRRR, traversing it in post order, construct a NF TTT(vv) for each node vv as follows (i) When vv is ε (ii) When vv is symbol a Σ (iii) When vv is operator (LL RR) I I ε a F F (iv) When vv is operator (LL RR) I I L vv LL F L I R vv RR F R (v) When vv is operator CC F I L vv LL vv RR F R I vv cc F 11
Move of the NF construction algorithm Ex) Parse tree TT RRRR for RRRR = (T G)((G ) ) 18 7 * 17 Ex) Thompson NF for RRRR = (T G)((G ) ) 0 T 1 2 3 G 3 4 5 6 T G 1 2 4 5 7 6 8 10 G 9 10 11 12 13 14 15 G 8 9 11 12 16 15 13 14 16 12 17
Pseudo code Thompson_recur (v) 1 if v = (v L, v R ) or v = (v L, v R ) then 2 Th(v L ) Thompson_recur(v L ); 3 Th(v R ) Thompson_recur(v R ); 4 else if v= * (v C ) then Th(v) Thompson_recur(v C ); 5 /* Recursive post-order traversal so far */ 6 if v=(ε) then return construction (i); 7 if v=(α), α Σ then return construction (ii); 8 if v= (v L, v R ) then return construction (iii); 9 if v= (v L, v R ) then return construction (iv); 10 if v= * (v C ) then return construction (v); Thompon(RE) 11 v RE Parse(RE$, 1); /* construct parse tree */ 12 Th(v RE ) Thompson_recur(v RE ); 13
Glushkov s NF construction method V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961. Idea: Make a new expression RE by numbering each symbol a Σ of RE in order from the left to the right (Let Σ be the alphabet with subscripts) Ex) RRRR = (T G)((G )*) RRRRR = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) Create an NF that accepts LL(RREE ), then convert it to the final NF by eliminating the subscripts of symbols Properties of Glushkov NF: #states is just mm + 1, but #transitions is O mm 2 There is no ε-transitions For any node vv, all the labels of transitions onto vv are the same Ex) NF for RREE = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) Ex) Glushkov NF T 0 1 2 7 G 3 0 1 T 1 2 2 3 4 4 5 G 5 6 6 7 7 8 8 9 9 5 7 5 7 G 3 4 G 5 6 7 8 5 9 14
NF construction algorithm (1) Let RRRRR be the numbered expression for RRRR PPPPPP RRRRR = {1,, mm}, Σ : the alphabet with subscripts Traversing parse tree TT RREE in post order, for each language RREE vv corresponding to the subtree with vv as the top node, calculate sets First(RREE vv ) and Last RREE vv, and functions Empty vv and Follow RREE, xx defined as follows: First(RRRR ) = {xx PPPPPP(RRRRR) uu Σ, αα xx uu LL(RRRRR)} Last(RRRR ) = {xx PPPPPP(RRRRR) uu Σ, uuαα xx LL(RRRRR)} Follow(RRRR, xx) = {yy PPPPPP(RRRR ) uu, vv Σ, uuαα xx αα yy vv LL(RRRR )} Empty RRRR returns {ε} if ε LL(RRRR), or φφ otherwise This can be recursively calculated as follows: Emptyε = ε, Emptya Σ = φφ, Empty RREE1 RREE 2 = Empty RREE1 Empty RREE2, Empty (RREE1 RREE 2 ) = Empty RREE1 Empty RREE2, Empty RRRR = ε. The NF is constructed based on the values obtained from the above Initial states of NF Final states of NF Transition function Is the initial state of the NF also a final state? 15
NF construction algorithm (2) Glushkov NF GGLL = SS, Σ, II, FF, δδ that accepts language LL(RRRRR) SS : set of states SS = 0, 1,, mm Σ :n alphabet with subscripts II :The initial state id, i.e., II = 0 FF δδ : set of the final states FF = Last(RREE ) Empty RRRR 0. :Transition function defined as follows xx PPPPPP RREE, yy Follow RREE, xx, δδδ xx, αα yy Transitions from the initial state is as follows. yy First(RRRRR), δδδ 0, αα yy = yy Ex) NF for RRRR = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) = yy 7 5 G 3 0 1 T 1 2 2 3 4 4 5 G 5 6 6 7 7 8 8 9 9 5 7 5 7 16
Pseudo code Glushkov_variables (v RE, lpos) 1 if v=[ ](v l,v r ) or v=[ ](v l,v r ) then 2 lpos Glushkov_variables(v l, lpos); 3 lpos Glushkov_variables(v r, lpos); 4 else if v=[*](v * ) then lpos Glushkov_variables(v *, lpos); 5 end of if 6 if v=(ε) then 7 First(v) φ, Last(v) φ, Empty v {ε}; 8 else if v=(a), a Σ then 9 lpos lpos + 1; 10 First(v) {lpos}, Last(v) {lpos}, Empty v φ, Follow(lpos) φ; 11 else if v=[ ](v l,v r ) then 12 First(v) First(v l ) First(v r ); 13 Last(v) Last(v l ) Last(v r ); 14 Empty v Empty vl Empty vr ; 15 else if v=[ ](v l,v r ) then 16 First(v) First(v l ) (Empty vl First(v r )); 17 Last(v) (Empty vr Last(v l )) Last(v r ); 18 Empty v Empty vl Empty vr ; O mm 3 19 for x Last(v l ) do Follow(x) Follow(x) First(v r ); 20 else if v=[*](v * ) then 21 First(v) First(v * ), Last(v) Last(v * ), Empty v {ε}; 22 for x Last(v * ) do Follow(x) Follow(x) First(v * ); 23 end of if 24 return lpos; time totally O mm 2 time 17
Pseudo code (cont.) Glushkov (RE) 1 /* make a parse tree by parsing RE */ 2 v RE Parse(RE$, 1); 3 4 /* calculate each variable by using the parse tree */ 5 m Glushkov_variables(v RE, 0); 6 7 /* construct NF GL(S,, I, F,δ) by the variables */ 8 Δ φ; 9 for i 0 m do create state I; 10 for x First(v RE ) do Δ Δ {(0, α x, x)}; 11 for i 0 m do 12 for i Follow(i) do Δ Δ {(i,α x, x)}; 13 end of for 14 for x Last(v RE ) (Empty vre {0}) do mark x as terminal; 18
Flow of matching process (reprint) General flow NF construction by Thompson method The NF is simulated in O(mmmm) time parsing text scan RE Parse tree NF Report the occ. NF construction by Glushkov method OO(2 mm ) time and space is needed for translating DF There exists a method of converting directly into a DF Refer Sec. 3.9 of Compilers Principles, Techniques and Tools written by. V. ho, R. Sethi, and J. D. Ullman. ddison-wesley, 1986. ( 邦訳 : コンパイラ 原理 技法 ツール ) 19
Methods of simulating an NF Simulating a Thompson NF directly The most naïve method Storing current active states with a list of size O(mm) and updating them in O(mm) time It obviously takes O(mmmm) time Simulating a Thompson NF by converting into an equivalent DF Based on the classical conversion technique It takes O(2 mm ) time and space preprocessing There is a method that dynamically converts necessary parts of the DF during text scan. V. ho, R. Sethi, and J. D. Ullman. Compilers Principles, Techniques and Tools. ddison-wesley, 1986. Efficient hybrid technique Dividing the Thompson NF into modules consist of O(kk) nodes, and converting each module The transitions between modules are simulated in an NF manner E. W. Myers. four Russians algorithm for regular expression pattern matching. Journal of the CM, 39(2):430-448, 1992. High-speed NF simulation by bit-parallel technique Simulating a Thompson NF: by S. Wu and U. Manber[1992] Simulating a Glushkov NF: by G. Navarro and M. Raffinot[1999] 20
Bit-parallel Thompson S. Wu and U. Manber. Fast text searching allowing errors. Communications of the CM, 35(10):83-91, 1992. Simulating a Thompson NF by bit-parallel technique For a Thompson NF, next to the ii-th state is always ii + 1-th except for ε transitions bit-parallel similar to Shift-nd method can be applicable ε-transitions are separately simulated a mask table of size 2 LL is needed (LL is #states of the NF) It takes O 2 LL + mm Σ time for preprocessing It scans in O(nn) time when LL is small enough Mask tables for Thompson NF QQ = ss 0,, ss QQ 1, Σ, II = ss 0, FF, Δ : For QQ nn = 0,, QQ 1, II nn = 0 QQ 1 1, and FF nn = ssjj FF0 QQ 1 jj 10 jj, BB nn ii, σσ = ssii,σσ,ss jj Δ 0 QQ 1 jj 10 jj, EE nn ii = ssjj EE ii 0 QQ 1 jj 10 jj (where EE(ii) is the -closure of ss ii ), EE dd DD = ii,ii=0 OR DD&0 LL ii 1 10 ii 0 LL EE nn ii, BB σσ = ii 0 mm BB nn ii, σσ, 21
Pseudo code BuildEps(N = (Q n,,i n,f n,b n,e n ) ) 1 for σ do 2 B[σ] 0 L ; 3 for i 0 L 1 do B[σ] B[σ] B n [i,σ]; 4 end of for 5 E d [0] E n [0]; 6 for i 0 L 1 do 7 for j 0 2 i 1 do 8 E d [2 i + j] E n [ i ] E d [ j ]; 9 end of for 10 end of for 11 return (B, E d ); BPThompson(N = (Q n,,i n,f n,b n,e n ), T = t 1 t 2 t n ) 1 Preprocessing: 2 (B, E d ) BuildEps(N); 3 Searching: 4 D E d [ I n ]; /* initial state */ 5 for pos 1 n do 6 if D & F n 0 L then report an occurrence ending at pos 1; 7 D E d [ (D << 1) & B[t pos ] ]; 8 end of for 22
Summary REs and finite automaton have the same ability to define languages Flow of regular expression matching Construct an NF via parse tree for RE, then simulate the NF to scan a text Filtration + pattern plurals collation + inspection + NF simulation How to construct an NF Thompson NF: #states < 2mm, #transitions < 4mm O(mm) space Contains many ε-transitions Transitions other than ε-transitions always are from ii to ii + 1 Glushkov NF: #states is just mm + 1, but #transitions is O mm 2 There is no ε-transitions For any node vv, all the labels of transitions onto vv are the same How to simulate an NF Simulating Thompson NFs directly O(mmmm) time Converting DF scans in O(nn) time, but takes O(2 mm ) time and space for preprocessing Speeding-up by bit-parallel techniques: Bit-parallel Thompson, Bit-parallel Glushkov The next theme: Compressed Pattern Matching 23
ppendix bout the definitions of terms which I didn t explain in the first lecture subset of Σ is called a formal language or a language for short For languages LL 1, LL 2 Σ, a set xxxx xx LL 1 and yy LL 2 } is called the product of LL 1 and LL 2 and denoted by LL 1 LL 2 or simply LL 1 LL 2 For language LL Σ, we define LL 0 =, LL nn = LL nn 1 LL (nn 1). Moreover, we define LL = nn=0 LL nn, and call it as the closure of LL. We also denote LL + = nn=1 LL nn bout look-behind notations Handbook of Theoretical Computer Science, Volume : lgorithms and Complexity, The MIT Press, Elsevier, 1990. ( 邦訳 ) コンピュータ基礎理論ハンドブック Ⅰ: アルゴリズムと複雑さ, 丸善,1994. Chapter 5, Sec.2.3 and Sec.6.1 ccording to this, it seems that the notion of look-behind had appeared in 1964 It exceeds the frame of context-free grammar (of course beyond RE)! The matching problem of it is proved to be NP-complete! 24