Special lecture on IKN. Information Knowledge Network -Information retrieval and pattern matching-

Similar documents
Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Dr. D.M. Akbar Hussain

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Ambiguous Grammars and Compactification

COMP Logic for Computer Scientists. Lecture 23

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

Syntax Analysis Top Down Parsing

Introduction to Parsing. Lecture 8

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

CS 4120 Introduction to Compilers

Course Project 2 Regular Expressions

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Formal Languages and Automata

COMP Logic for Computer Scientists. Lecture 25

Languages and Compilers

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

CSE 105 THEORY OF COMPUTATION

recruitment Logo Typography Colourways Mechanism Usage Pip Recruitment Brand Toolkit

Implementation of Lexical Analysis

Lexical Analysis. Introduction

Formal languages and computation models

Similarity and Model Testing

Formal Languages and Compilers Lecture VI: Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis


Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Neha 1, Abhishek Sharma 2 1 M.Tech, 2 Assistant Professor. Department of Cse, Shri Balwant College of Engineering &Technology, Dcrust University

BRAND STANDARD GUIDELINES 2014

On Strongly *-Graphs

Lexical Analysis. Lecture 3-4

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

CMSC 330: Organization of Programming Languages

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

In One Slide. Outline. LR Parsing. Table Construction

Regular Languages and Regular Expressions

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Visual Identity Guidelines. Abbreviated for Constituent Leagues

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Midterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2

UNIT -2 LEXICAL ANALYSIS

CMSC 330: Organization of Programming Languages

Implementation of Lexical Analysis

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Introduction to Lexing and Parsing

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

Visit MathNation.com or search "Math Nation" in your phone or tablet's app store to watch the videos that go along with this workbook!

shift-reduce parsing

Lexical Analysis. Chapter 2

Assignment 4 CSE 517: Natural Language Processing

1. [5 points each] True or False. If the question is currently open, write O or Open.

Bottom-Up Parsing. Lecture 11-12

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

University of Nevada, Las Vegas Computer Science 456/656 Fall 2016

More Bottom-Up Parsing

Implementation of Lexical Analysis

Lecture 3.3 Robust estimation with RANSAC. Thomas Opsahl

Context-Free Languages and Parse Trees

Lexical Analysis. Lecture 2-4

The ABC s of Web Site Evaluation

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Lecture 9: Transformations. CITS3003 Graphics & Animation

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Formal Grammars and Abstract Machines. Sahar Al Seesi

CT32 COMPUTER NETWORKS DEC 2015

CMSC 330: Organization of Programming Languages

Module 6 Lexical Phase - RE to DFA

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

"Charting the Course... MOC A Planning, Deploying and Managing Microsoft Forefront TMG Course Summary

Multiple Choice Questions

BRANDING AND STYLE GUIDELINES

Implementation of Lexical Analysis

Wisconsin Retirement Testing Preparation

JNTUWORLD. Code No: R

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

1. Lexical Analysis Phase

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Chapter 4. Lexical and Syntax Analysis

CS415 Compilers. Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Palatino. Palatino. Linotype. Palatino. Linotype. Linotype. Palatino. Linotype. Palatino. Linotype. Palatino. Linotype

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Talen en Compilers. Johan Jeuring , period 2. January 17, Department of Information and Computing Sciences Utrecht University

Introduction to Lexical Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Lexical Analysis/Scanning

Theory of Computations Spring 2016 Practice Final Exam Solutions

Mathematical Induction

Abstract Syntax Trees L3 24

Decision, Computation and Language

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Transcription:

Special lecture on Information Knowledge Network -Information retrieval and pattern matching- The 5th Regular expression matching Takuya kida IKN Laboratory, Division of Computer Science and Information Technology Special lecture on IKN 2017/11/22

Today s contents bout regular expression Flow of matching processing Construction of a parse tree for a RE Construction of a NF for RE matching How to simulate the NF? 2

What is regular expression? notation for flexible and strong pattern matching Console command example: rm *.txt cp Important[0-9].doc Grep search example: Match to any filename of.txt Match to Important0.doc ~Important9.doc grep E for.+(256 CHR_SIZE) *.c Matching script example on Perl: m ^http://.+.jp/.+$ Match to strings that start with http:// followed by.jp/ regular expression can express a regular set (regular language) = can express a language (set of strings) LL of which a finite automaton can accept 3

Definition of regular expression regular expression (RE) is a string over Σ {ε,,,,(,)} which is recursively defined by the following rules: (1) and any elements of Σ are REs (2) If αα and ββ are REs, then (αα ββ) is a RE (3) If αα and ββ are REs, then (αα ββ) is a RE (4) If αα is a RE, then αα is a RE (5) Only those derived from the above are REs 例 : ( (( T) (C G)) ) (T CG) Symbols,, and are called operator Symbol + is often used as αα+ = αα αα for RE αα αα ββ is abbreviated as αααα for convenience 4

Semantic of regular expression RE is mapped into a subset of Σ (Language LL) (i) = (ii) (iii) (iv) For any a Σ, a = {a} For any REs αα and ββ, (αα ββ) = αα ββ For any REs αα and ββ, (αα ββ) = αα ββ (v) For any RE αα, αα = αα For example: (a (a b) ) (a (a b) ) = a (a b) = {a} a b = {a} a b = {axxxxx a, b } n DF equivalent to the left example q 0 q 2 b a q 1 a,b Execise: how about (T G)(TT)*? a,b 5

What is the RE matching problem? Regular expression matching problem is the problem of finding any strings in LL αα = αα for RE αα from a text REs and finite automaton have the same ability to define languages We can construct a F MM that accepts language LL(αα) for RE αα We can also describe a RE αα that derives language LL(MM) for F MM refer to "utomaton and computability" (Sec. 2.5) by Setsuo rikawa and Satoru Miyano Create a DF/NF corresponding to a given RE and simulate the movement It is easier to convert to a NF than to a DF The pattern occurrences can be found when the F reaches to its final states while reading a text 6

Flow of matching process General flow NF construction by Thompson method parsing text scan RE Parse tree NF Report the occ. NF construction by Glushkov method DF Flow with filtering technique extraction multiple PM verify RE set of factors Find candidates Report the occ. 7

Construction of parse tree Parse tree: a tree structure used in preparation for making NF Each leaf is labeled by symbol a Σ or the empty word ε. Each internal node is labeled by xx {,, }. Ex) Parse tree TT RRRR for RRRR = (T G)((G ) ) (T G)((G )*) T G * G Depth Operator 1 2 8

Pseudo code Parse (p=p 1 p 2 p m, last) 1 v θ; 2 while p last $ do 3 if p last Σ or p last = then /* normal character */ 4 v r Create a node with p last ; 5 if v θ then v [ ](v, v r ); 6 else v v r ; 7 last last + 1; 8 else if p last = then /* union operator */ 9 (v r, last) Parse(p, last + 1); 10 v [ ](v, v r ); 11 else if p last = * then /* star operator */ 12 v [ * ](v); 13 last last + 1; 14 else if p last = ( then /* open parenthesis */ 15 (v r, last) Parse(p, last + 1); 16 last last + 1; 17 if v θthen v [ ](v, v r ); 18 else v v r ; 19 else if p last = ) then /* close parenthesis */ 20 return (v, last); 21 end of if 22 end of while 23 return (v, last); 9

Thompson s NF construction method Idea: K. Thompson. Regular expression search algorithm. Communications of the CM, 11:419-422, 1968. Construct NF TTT(vv) that accepts language LL RREE vv corresponding to the subtree with vv as the top while traversing parse tree TT RRRR in post order Each TTh vv is obtained by concatenating the automaton for the children of vv with ε-transitions Properties of Thompson NF: #states < 2mm, #transitions < 4mm O(mm) Contains many ε-transitions Transitions other than ε-transitions always are from ii to ii + 1 Ex) Thompson NF for RRRR = (T G)((G )*) 0 1 2 G T 3 4 5 6 7 8 G 9 10 11 12 13 14 15 16 10 17

NF construction algorithm For parse tree TT RRRR, traversing it in post order, construct a NF TTT(vv) for each node vv as follows (i) When vv is ε (ii) When vv is symbol a Σ (iii) When vv is operator (LL RR) I I ε a F F (iv) When vv is operator (LL RR) I I L vv LL F L I R vv RR F R (v) When vv is operator CC F I L vv LL vv RR F R I vv cc F 11

Move of the NF construction algorithm Ex) Parse tree TT RRRR for RRRR = (T G)((G ) ) 18 7 * 17 Ex) Thompson NF for RRRR = (T G)((G ) ) 0 T 1 2 3 G 3 4 5 6 T G 1 2 4 5 7 6 8 10 G 9 10 11 12 13 14 15 G 8 9 11 12 16 15 13 14 16 12 17

Pseudo code Thompson_recur (v) 1 if v = (v L, v R ) or v = (v L, v R ) then 2 Th(v L ) Thompson_recur(v L ); 3 Th(v R ) Thompson_recur(v R ); 4 else if v= * (v C ) then Th(v) Thompson_recur(v C ); 5 /* Recursive post-order traversal so far */ 6 if v=(ε) then return construction (i); 7 if v=(α), α Σ then return construction (ii); 8 if v= (v L, v R ) then return construction (iii); 9 if v= (v L, v R ) then return construction (iv); 10 if v= * (v C ) then return construction (v); Thompon(RE) 11 v RE Parse(RE$, 1); /* construct parse tree */ 12 Th(v RE ) Thompson_recur(v RE ); 13

Glushkov s NF construction method V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961. Idea: Make a new expression RE by numbering each symbol a Σ of RE in order from the left to the right (Let Σ be the alphabet with subscripts) Ex) RRRR = (T G)((G )*) RRRRR = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) Create an NF that accepts LL(RREE ), then convert it to the final NF by eliminating the subscripts of symbols Properties of Glushkov NF: #states is just mm + 1, but #transitions is O mm 2 There is no ε-transitions For any node vv, all the labels of transitions onto vv are the same Ex) NF for RREE = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) Ex) Glushkov NF T 0 1 2 7 G 3 0 1 T 1 2 2 3 4 4 5 G 5 6 6 7 7 8 8 9 9 5 7 5 7 G 3 4 G 5 6 7 8 5 9 14

NF construction algorithm (1) Let RRRRR be the numbered expression for RRRR PPPPPP RRRRR = {1,, mm}, Σ : the alphabet with subscripts Traversing parse tree TT RREE in post order, for each language RREE vv corresponding to the subtree with vv as the top node, calculate sets First(RREE vv ) and Last RREE vv, and functions Empty vv and Follow RREE, xx defined as follows: First(RRRR ) = {xx PPPPPP(RRRRR) uu Σ, αα xx uu LL(RRRRR)} Last(RRRR ) = {xx PPPPPP(RRRRR) uu Σ, uuαα xx LL(RRRRR)} Follow(RRRR, xx) = {yy PPPPPP(RRRR ) uu, vv Σ, uuαα xx αα yy vv LL(RRRR )} Empty RRRR returns {ε} if ε LL(RRRR), or φφ otherwise This can be recursively calculated as follows: Emptyε = ε, Emptya Σ = φφ, Empty RREE1 RREE 2 = Empty RREE1 Empty RREE2, Empty (RREE1 RREE 2 ) = Empty RREE1 Empty RREE2, Empty RRRR = ε. The NF is constructed based on the values obtained from the above Initial states of NF Final states of NF Transition function Is the initial state of the NF also a final state? 15

NF construction algorithm (2) Glushkov NF GGLL = SS, Σ, II, FF, δδ that accepts language LL(RRRRR) SS : set of states SS = 0, 1,, mm Σ :n alphabet with subscripts II :The initial state id, i.e., II = 0 FF δδ : set of the final states FF = Last(RREE ) Empty RRRR 0. :Transition function defined as follows xx PPPPPP RREE, yy Follow RREE, xx, δδδ xx, αα yy Transitions from the initial state is as follows. yy First(RRRRR), δδδ 0, αα yy = yy Ex) NF for RRRR = ( 1 T 2 G 3 4 )(( 5 G 6 7 8 9 )*) = yy 7 5 G 3 0 1 T 1 2 2 3 4 4 5 G 5 6 6 7 7 8 8 9 9 5 7 5 7 16

Pseudo code Glushkov_variables (v RE, lpos) 1 if v=[ ](v l,v r ) or v=[ ](v l,v r ) then 2 lpos Glushkov_variables(v l, lpos); 3 lpos Glushkov_variables(v r, lpos); 4 else if v=[*](v * ) then lpos Glushkov_variables(v *, lpos); 5 end of if 6 if v=(ε) then 7 First(v) φ, Last(v) φ, Empty v {ε}; 8 else if v=(a), a Σ then 9 lpos lpos + 1; 10 First(v) {lpos}, Last(v) {lpos}, Empty v φ, Follow(lpos) φ; 11 else if v=[ ](v l,v r ) then 12 First(v) First(v l ) First(v r ); 13 Last(v) Last(v l ) Last(v r ); 14 Empty v Empty vl Empty vr ; 15 else if v=[ ](v l,v r ) then 16 First(v) First(v l ) (Empty vl First(v r )); 17 Last(v) (Empty vr Last(v l )) Last(v r ); 18 Empty v Empty vl Empty vr ; O mm 3 19 for x Last(v l ) do Follow(x) Follow(x) First(v r ); 20 else if v=[*](v * ) then 21 First(v) First(v * ), Last(v) Last(v * ), Empty v {ε}; 22 for x Last(v * ) do Follow(x) Follow(x) First(v * ); 23 end of if 24 return lpos; time totally O mm 2 time 17

Pseudo code (cont.) Glushkov (RE) 1 /* make a parse tree by parsing RE */ 2 v RE Parse(RE$, 1); 3 4 /* calculate each variable by using the parse tree */ 5 m Glushkov_variables(v RE, 0); 6 7 /* construct NF GL(S,, I, F,δ) by the variables */ 8 Δ φ; 9 for i 0 m do create state I; 10 for x First(v RE ) do Δ Δ {(0, α x, x)}; 11 for i 0 m do 12 for i Follow(i) do Δ Δ {(i,α x, x)}; 13 end of for 14 for x Last(v RE ) (Empty vre {0}) do mark x as terminal; 18

Flow of matching process (reprint) General flow NF construction by Thompson method The NF is simulated in O(mmmm) time parsing text scan RE Parse tree NF Report the occ. NF construction by Glushkov method OO(2 mm ) time and space is needed for translating DF There exists a method of converting directly into a DF Refer Sec. 3.9 of Compilers Principles, Techniques and Tools written by. V. ho, R. Sethi, and J. D. Ullman. ddison-wesley, 1986. ( 邦訳 : コンパイラ 原理 技法 ツール ) 19

Methods of simulating an NF Simulating a Thompson NF directly The most naïve method Storing current active states with a list of size O(mm) and updating them in O(mm) time It obviously takes O(mmmm) time Simulating a Thompson NF by converting into an equivalent DF Based on the classical conversion technique It takes O(2 mm ) time and space preprocessing There is a method that dynamically converts necessary parts of the DF during text scan. V. ho, R. Sethi, and J. D. Ullman. Compilers Principles, Techniques and Tools. ddison-wesley, 1986. Efficient hybrid technique Dividing the Thompson NF into modules consist of O(kk) nodes, and converting each module The transitions between modules are simulated in an NF manner E. W. Myers. four Russians algorithm for regular expression pattern matching. Journal of the CM, 39(2):430-448, 1992. High-speed NF simulation by bit-parallel technique Simulating a Thompson NF: by S. Wu and U. Manber[1992] Simulating a Glushkov NF: by G. Navarro and M. Raffinot[1999] 20

Bit-parallel Thompson S. Wu and U. Manber. Fast text searching allowing errors. Communications of the CM, 35(10):83-91, 1992. Simulating a Thompson NF by bit-parallel technique For a Thompson NF, next to the ii-th state is always ii + 1-th except for ε transitions bit-parallel similar to Shift-nd method can be applicable ε-transitions are separately simulated a mask table of size 2 LL is needed (LL is #states of the NF) It takes O 2 LL + mm Σ time for preprocessing It scans in O(nn) time when LL is small enough Mask tables for Thompson NF QQ = ss 0,, ss QQ 1, Σ, II = ss 0, FF, Δ : For QQ nn = 0,, QQ 1, II nn = 0 QQ 1 1, and FF nn = ssjj FF0 QQ 1 jj 10 jj, BB nn ii, σσ = ssii,σσ,ss jj Δ 0 QQ 1 jj 10 jj, EE nn ii = ssjj EE ii 0 QQ 1 jj 10 jj (where EE(ii) is the -closure of ss ii ), EE dd DD = ii,ii=0 OR DD&0 LL ii 1 10 ii 0 LL EE nn ii, BB σσ = ii 0 mm BB nn ii, σσ, 21

Pseudo code BuildEps(N = (Q n,,i n,f n,b n,e n ) ) 1 for σ do 2 B[σ] 0 L ; 3 for i 0 L 1 do B[σ] B[σ] B n [i,σ]; 4 end of for 5 E d [0] E n [0]; 6 for i 0 L 1 do 7 for j 0 2 i 1 do 8 E d [2 i + j] E n [ i ] E d [ j ]; 9 end of for 10 end of for 11 return (B, E d ); BPThompson(N = (Q n,,i n,f n,b n,e n ), T = t 1 t 2 t n ) 1 Preprocessing: 2 (B, E d ) BuildEps(N); 3 Searching: 4 D E d [ I n ]; /* initial state */ 5 for pos 1 n do 6 if D & F n 0 L then report an occurrence ending at pos 1; 7 D E d [ (D << 1) & B[t pos ] ]; 8 end of for 22

Summary REs and finite automaton have the same ability to define languages Flow of regular expression matching Construct an NF via parse tree for RE, then simulate the NF to scan a text Filtration + pattern plurals collation + inspection + NF simulation How to construct an NF Thompson NF: #states < 2mm, #transitions < 4mm O(mm) space Contains many ε-transitions Transitions other than ε-transitions always are from ii to ii + 1 Glushkov NF: #states is just mm + 1, but #transitions is O mm 2 There is no ε-transitions For any node vv, all the labels of transitions onto vv are the same How to simulate an NF Simulating Thompson NFs directly O(mmmm) time Converting DF scans in O(nn) time, but takes O(2 mm ) time and space for preprocessing Speeding-up by bit-parallel techniques: Bit-parallel Thompson, Bit-parallel Glushkov The next theme: Compressed Pattern Matching 23

ppendix bout the definitions of terms which I didn t explain in the first lecture subset of Σ is called a formal language or a language for short For languages LL 1, LL 2 Σ, a set xxxx xx LL 1 and yy LL 2 } is called the product of LL 1 and LL 2 and denoted by LL 1 LL 2 or simply LL 1 LL 2 For language LL Σ, we define LL 0 =, LL nn = LL nn 1 LL (nn 1). Moreover, we define LL = nn=0 LL nn, and call it as the closure of LL. We also denote LL + = nn=1 LL nn bout look-behind notations Handbook of Theoretical Computer Science, Volume : lgorithms and Complexity, The MIT Press, Elsevier, 1990. ( 邦訳 ) コンピュータ基礎理論ハンドブック Ⅰ: アルゴリズムと複雑さ, 丸善,1994. Chapter 5, Sec.2.3 and Sec.6.1 ccording to this, it seems that the notion of look-behind had appeared in 1964 It exceeds the frame of context-free grammar (of course beyond RE)! The matching problem of it is proved to be NP-complete! 24