Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

Similar documents
Lexical Analysis: Constructing a Scanner from Regular Expressions

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

Definition of Regular Expression

Dr. D.M. Akbar Hussain

What are suffix trees?

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Topic 2: Lexing and Flexing

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

Fig.25: the Role of LEX

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSEP 573 Artificial Intelligence Winter 2016

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

Assignment 4. Due 09/18/17

Lecture T4: Pattern Matching

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Orthogonal line segment intersection

CS481: Bioinformatics Algorithms

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Suffix trees, suffix arrays, BWT

CS 221: Artificial Intelligence Fall 2011

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Homework. Context Free Languages III. Languages. Plan for today. Context Free Languages. CFLs and Regular Languages. Homework #5 (due 10/22)

Lexical analysis, scanners. Construction of a scanner

Theory of Computation CSE 105

ECE 468/573 Midterm 1 September 28, 2012

CS 340, Fall 2014 Dec 11 th /13 th Final Exam Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Lexical Analysis and Lexical Analyzer Generators

CMSC 331 First Midterm Exam

CS 430 Spring Mike Lam, Professor. Parsing

ITEC2620 Introduction to Data Structures

LEX5: Regexps to NFA. Lexical Analysis. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Information Retrieval and Organisation

Principles of Programming Languages

ASTs, Regex, Parsing, and Pretty Printing

2014 Haskell January Test Regular Expressions and Finite Automata

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

Data Flow on a Queue Machine. Bruno R. Preiss. Copyright (c) 1987 by Bruno R. Preiss, P.Eng. All rights reserved.

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

CSCI 446: Artificial Intelligence

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

COMBINATORIAL PATTERN MATCHING

Reducing a DFA to a Minimal DFA

CSCE 531, Spring 2017, Midterm Exam Answer Key

Presentation Martin Randers

On String Matching in Chunked Texts

Algorithm Design (5) Text Search

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Compilers Spring 2013 PRACTICE Midterm Exam

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Stack. A list whose end points are pointed by top and bottom

COS 333: Advanced Programming Techniques

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Uninformed Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 31 Jan 2012

Midterm 2 Sample solution

TO REGULAR EXPRESSIONS

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

COMP 423 lecture 11 Jan. 28, 2008

COS 333: Advanced Programming Techniques

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

Deterministic. Finite Automata. And Regular Languages. Fall 2018 Costas Busch - RPI 1

Very sad code. Abstraction, List, & Cons. CS61A Lecture 7. Happier Code. Goals. Constructors. Constructors 6/29/2011. Selectors.

Finite-State Techniques for Speech Recognition

String Searching. String Search. Applications. Brute Force: Typical Case

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

CS 268: IP Multicast Routing

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1 COMPUTATION & LOGIC INSTRUCTIONS TO CANDIDATES

Scanner Termination. Multi Character Lookahead. to its physical end. Most parsers require an end of file token. Lex and Jlex automatically create an

Fall 2018 Midterm 1 October 11, ˆ You may not ask questions about the exam except for language clarifications.

Phylogeny and Molecular Evolution

CMPSC 470: Compiler Construction

UNIT 11. Query Optimization

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Lexical Analysis. Role, Specification & Recognition Tool: LEX Construction: - RE to NFA to DFA to min-state DFA - RE to DFA

Ma/CS 6b Class 1: Graph Recap

1 Quad-Edge Construction Operators

The Greedy Method. The Greedy Method

The Distributed Data Access Schemes in Lambda Grid Networks

10/12/17. Motivating Example. Lexical and Syntax Analysis (2) Recursive-Descent Parsing. Recursive-Descent Parsing. Recursive-Descent Parsing

Midterm I Solutions CS164, Spring 2006

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

CS 321 Programming Languages and Compilers. Bottom Up Parsing

Automata Processor. Tobias Markus Computer Architecture Group, University of Heidelberg

Looking up objects in Pastry

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

Intermediate Information Structures

Should be done. Do Soon. Structure of a Typical Compiler. Plan for Today. Lab hours and Office hours. Quiz 1 is due tonight, was posted Tuesday night

Knowledge States: A Tool in Randomized Online Algorithms

Fall Compiler Principles Lecture 1: Lexical Analysis. Roman Manevich Ben-Gurion University of the Negev

LECT-10, S-1 FP2P08, Javed I.

Transcription:

Regulr Expression Mtching with Multi-Strings nd Intervls Philip Bille Mikkel Thorup

Outline Definition Applictions Previous work Two new problems: Multi-strings nd chrcter clss intervls Algorithms Thompson s lgorithm with multi-strings. Decomposition-bsed lgorithms with multi-strings. Chrcter clss intervls extensions.

Regulr Expressions A chrcter α is regulr expression. If S nd T re regulr expressions, then so is The union S T The conctention ST (S T) The kleene str S*

Lnguges The lnguge L(R) of regulr expression R is: L(α) = {α} L(S T) = L(S) L(T) L(ST) = L(S)L(T) L(S*) = {ε} L(S) L(S) 2 L(S) 3

Exmple R = (*)(b c) L(R) = {b, c, b, c, b, c,...}

Regulr Expression Mtching Given regulr expression R nd string Q the regulr expression mtching problem is to decide if Q L(R).

Applictions Primitive in lrge scle dt processing: Internet Trffic Anlysis Protein serching XML queries Stndrd utilities nd tools Grep nd Sed Perl

Previous Work (Worst-Cse Efficient Algorithms) Let R = m nd Q = n. Stndrd textbook lgorithm [Thompson 1968] simultes non-determinstic utomton (NFA) in O(nm) time. NFA-decomposition lgorithms [Myers 1992], [B 2006], [B,Frch-Colton 2005], [B, T 2009]: Decompose NFA into tree of smll NFAs nd combine with tbultion nd/ or word-level prllelism to speedup Thompson s lgorithm. We will need O(n (m log w/ w + log m)) time lgorithm [B 2006] for our results. Fstest known lgorithm for lrge w.

Problem 1: Multi-Strings Mny regulr expressions consist k << m strings. Exmple: Gnutell downlod strem detection: (Server: User-Agent:)( \t)*(limewire BerShre Gnucleus Morpheus XoloX gtk-gnutell Mutell MyNpster Qtell AquLime NpShre Combck PHEX SwpNut FreeWire Openext Todnode) k = 21 vs. m = 174. Cn we exploit k << m in lgorithms for regulr expression mtching?

Problem 2: Chrcter Clss Intervls For subset of chrcters C chrcter clss intervl C{x,y} represents string of chrcter from C of length t lest x nd t most y. Exmple: [fg]{13,42} Specil cse of gps ( Σ{x,y} ) is importnt in protein serching. We cn lwys convert chrcter clss intervl opertor to stndrd opertors but this increses the length of regulr expression by y. Cn we efficiently implement chrcter clss intervl opertors in regulr expression mtching?

Outline Definition Applictions Previous work Two new problems: Multi-strings nd chrcter clss intervls Algorithms Thompson s lgorithm with multi-strings. Decomposition-bsed lgorithms with multi-strings. Chrcter clss intervls extensions.

Thompson s Algorithm () α (b) N(S) N(T ) ɛ N(S) ɛ ɛ (c) ɛ N(T ) ɛ (d) ɛ N(S) ɛ ɛ Recursively construct non-deterministic finite utomton (NFA) from R.

Thompson s Algorithm R = (b )* b 4 5 6 b b Thompson NFA (TNFA) N(R) hs O( R ) = O(m) sttes nd trnsitions. N(R) ccepts L(R). Any pth from strt to ccept stte corresponds to string in L(R) nd vice vers. Trverse TNFA on Q one chrcter t time. O(m) per chrcter => O( Q m) = O(nm) time lgorithm. Cn we get O(nk)?

Thompson s Algorithm with Multi-Strings b 4 5 6 b 1 6 3 2 Construct pruned TNFA: Replce strings L = {L1,..., Lk} with single trnsitions => number of sttes nd trnsitions is O(k). Mintin FIFO bit queue for Li of length Li. Preprocess L for fst multi-string mtching (Aho-Corsick utomton).

Thompson s Algorithm with Multi-Strings b 4 5 6 b 1 6 3 2 Interleved trversl of TNFA nd multi-string mtching on one chrcter from Q t time: Strtpoint of string trnsition ctive => Enqueue 1 else 0. Front of queue 1 nd mtch of string => Mke endpoint ctive. O(k) sttes nd trnsition, k queues, multi-string mtching is fst => O(k) time per chrcter => Totl time O(nk + m log k) nd spce O(m).

Decomposition Algorithms We use NFA-decomposition lgorithm bsed on word-level prllelism [B 2006]: Simplifying ssumption: m w. Decompose TNFA into tree of O(m/w) micro TNFAs, ech with t most w sttes. Encode ech micro TNFA stte-set in O(w) bits. Micro TNFA trversl on single chrcter in O(log w) time using word-level prllelism. => O(m/w log w) on single chrcter for entire TNFA => O(nmlog w/ w) lgorithm for regulr expression mtching. Fstest known for lrge w.

Decomposition Algorithms with Multi-Strings Gol: Replce m with k. Process chrcter in O(k log w/w) time. Apply decomposition on pruned TNFA: Tree of O(k/w) micro TNFAs with t most w sttes nd w strings. Reuse ε-trnsition trversl => O(log w) per micro TNFA Reuse multi-string mtching lgorithm. The missing piece: How cn we mintin w bit queues in O(log w) time per opertion?

Cse 1: Short Bit Queues (length 2w) First, suppose ll queues hve the sme length! Represent queues verticlly. In ech step insert input bits in bck of queue nd output the front of the queue. Implicitly move ll bits forwrd by updting the pointer to the strt of the queues. => O(1) time per step.

Cse 1: Different lengths?

Cse 1: Short Bit Queues (length 2w) With bit msk nd stndrd bitwise opertion we cn implement ech jump point in O(1) time. => O(log w) time per step.

Cse 2: Long Bit Queues (length > 2w) Horizontl representtion with verticl front nd bck buffers of length w. Enqueue nd dequeue from buffers in O(1) time. Every w steps (full buffers): Trnspose the bck buffer nd insert into horizontl representtion. Trnspose the front w entries of the horizontl representtion nd insert into the front buffer. Trnspose tkes time O(w log w) [T 1997] => Amortized O(log w) time per step.

Algorithm Summry O(log w) per chrcter per micro TNFA => O(k \log w /w) per chrcter. => totl time O(n (k log w/w + log k) + m log k) nd spce O(m).

Chrcter Clss Intervls New technique to mintin w counters in prllel with reset nd decrement opertions. Combine with bit queues to support chrcter clss intervls.