Simple Real-Time Constant-Space String Matching. Dany Breslauer, Roberto Grossi and Filippo Mignosi

Similar documents
Optimal Packed String Matching

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Experimental Results on String Matching Algorithms

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Regular Languages and Regular Expressions

A NEW STRING MATCHING ALGORITHM

Efficient Data Structures for the Factor Periodicity Problem

Fast Parallel String Prex-Matching. Dany Breslauer. April 6, Abstract. n log m -processor CRCW-PRAM algorithm for the

Announcements. Programming assignment 1 posted - need to submit a.sh file

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

1.3 Functions and Equivalence Relations 1.4 Languages

Tying up the loose ends in fully LZW-compressed pattern matching

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

CSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)

Suffix Trees and their Applications in String Algorithms

CHAPTER TWO LANGUAGES. Dr Zalmiyah Zakaria

String Matching Algorithms

Frequency Covers for Strings

Advanced Combinatorial Optimization September 15, Lecture 2

Suffix Trees and Arrays

Algorithms and Data Structures Lesson 3

String Matching. Geetha Patil ID: Reference: Introduction to Algorithms, by Cormen, Leiserson and Rivest

S ELECTE Columbia University

Small-Space 2D Compressed Dictionary Matching

Experiments on string matching in memory structures

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Matchings in Graphs. Definition 1 Let G = (V, E) be a graph. M E is called as a matching of G if v V we have {e M : v is incident on e E} 1.

K 4 C 5. Figure 4.5: Some well known family of graphs

A Unifying Look at the Apostolico Giancarlo String-Matching Algorithm

Problem Set 2 Solutions

Math 778S Spectral Graph Theory Handout #2: Basic graph theory

Tilings of the Sphere by Edge Congruent Pentagons

CSCI S-Q Lecture #13 String Searching 8/3/98

Prediction of Infinite Words with Automata

String Matching using Inverted Lists

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

String Matching Algorithms

Recognizing Interval Bigraphs by Forbidden Patterns

The Exact Online String Matching Problem: A Review of the Most Recent Results

Lecture 15. Error-free variable length schemes: Shannon-Fano code

LZ UTF8. LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties:

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

String matching algorithms

Fast Evaluation of Union-Intersection Expressions

Turing Machine Languages

Finding Strongly Connected Components

Application of String Matching in Auto Grading System

Computational Geometry

Proof Techniques Alphabets, Strings, and Languages. Foundations of Computer Science Theory

Online Motion Planning MA-INF 1314 Online TSP, Shortcut algorithm

Computer Science Spring 2005 Final Examination, May 12, 2005

1 Variations of the Traveling Salesman Problem

3 Competitive Dynamic BSTs (January 31 and February 2)

Λέων-Χαράλαμπος Σταματάρης

MATH 409 LECTURE 10 DIJKSTRA S ALGORITHM FOR SHORTEST PATHS

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

CSE 502 Class 23. Jeremy Buhler Steve Cole. April BFS solves unweighted shortest path problem. Every edge traversed adds one to path length

A New String Matching Algorithm Based on Logical Indexing

A TIGHT BOUND ON THE LENGTH OF ODD CYCLES IN THE INCOMPATIBILITY GRAPH OF A NON-C1P MATRIX

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

additive output sensitive cost O(occ). Indexes based upon sux trees and sux arrays and related data structures are especially ecient when several sear

arxiv: v1 [cs.ds] 15 Jan 2014

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Applications of Succinct Dynamic Compact Tries to Some String Problems

LCP Array Construction

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Graph Algorithms Using Depth First Search

Homework 2. Sample Solution. Due Date: Thursday, May 31, 11:59 pm

Mobius Graphs. N. Vasumathi and S. Vangipuram

Dynamic Programming CS 445. Example: Floyd Warshll Algorithm: Computing all pairs shortest paths

Solutions to Assessment

Fast Substring Matching

Algorithm Design and Analysis

Introduction to Algorithms I

Survey of Exact String Matching Algorithm for Detecting Patterns in Protein Sequence

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60

NP-completeness of 4-incidence colorability of semi-cubic graphs

Given a graph, find an embedding s.t. greedy routing works

Enhanced Two Sliding Windows Algorithm For Pattern Matching (ETSW) University of Jordan, Amman Jordan

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Solutions to the Second Midterm Exam

CSCE 411 Design and Analysis of Algorithms

Space Efficient Linear Time Construction of

Math 776 Graph Theory Lecture Note 1 Basic concepts

Dr. Alexander Souza. Winter term 11/12

UNIVERSITY of OSLO. Faculty of Mathematics and Natural Sciences. INF 4130/9135: Algoritmer: Design og effektivitet Date of exam: 14th December 2012

Strong edge coloring of subcubic graphs

University of Waterloo CS240R Winter 2018 Help Session Problems

Finite State Automata are Limited. Let us use (context-free) grammars!

Computational Geometry

Lecture 8: The Traveling Salesman Problem

CS2210 Data Structures and Algorithms

Ukkonen s suffix tree algorithm

Computing minimum distortion embeddings into a path for bipartite permutation graphs and threshold graphs

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

Optimization I : Brute force and Greedy strategy

Edit Distance with Single-Symbol Combinations and Splits

Transcription:

Simple Real-Time Constant-Space String Matching Dany Breslauer, Roberto Grossi and Filippo Mignosi

Real-time string matching Pattern X X[1..m] Text T T[1..n]

Real-time string matching Pattern X X[1..m] Text T T[1..n] O(1) worst-case time to answer after reading the text symbol

Real-time string matching Pattern X X[1..m] Text T T[1..n] Different from real-time streaming s.m., where X and T cannot be entirely stored! O(1) worst-case time to answer after reading the text symbol

Constant-space string matching Pattern X X[1..m] Text T T[1..n] O(1) working space apart from that required by X and T O(log n) bits

We propose a simple way to combine the two features Take a simple version of the constant-space Crochemore- Perrin (CP) algorithm

We propose a simple way to combine the two features Take a simple version of the constant-space Crochemore- Perrin (CP) algorithm Make CP also real-time by running two instances simultaneously

Some related work Galil '81: real-time string matching Galil, Seiferas '83: constant space Karp, Rabin '87: randomized constant space real-time Crochemore, Perrin '91: constant space Gasieniec, Plandowski, Rytter '95: constant space Gasienec, Kolpakov '04: real-time + sublinear space (extends GPR'95) more papers [Crochemore, Rytter '91,'95] [Crochemore '92] [...] Porat, Porat '09: randomized streaming, O(log m) space, no real-time Breslauer, Galil '10: randomized real-time streaming, O(log m) space

Our result Real-time constant-space string matching O(1) words in addition to those for read-only X and T O(1) worst-case time to answer after each text symbol

Our result Real-time constant-space string matching O(1) words in addition to those for read-only X and T O(1) worst-case time to answer after each text symbol Not to be confused with Real-time streaming string matching O(log m) memory words (X and T cannot be kept) O(1) worst-case time to answer after each text symbol

We propose a simple way to combine the two features Take a simple version of the constant-space Crochemore- Perrin (CP) algorithm Make CP also real-time by running two instances simultaneously

Simple version of the Crochemore-Perrin (CP) algorithm Consider a non-empty prefix-suffix factorization X = u v The local period is the shortest z such that z is suffix of u or vice versa and z is a prefix of v or vice versa u z z v (u,v) length z of the local period

Simple version of the Crochemore-Perrin (CP) algorithm Consider a non-empty prefix-suffix factorization X = u v The local period is the shortest z such that z is suffix of u or vice versa and z is a prefix of v or vice versa u z z v (u,v) length z of the local period Example: X = abaaaba X = u v a baaaba ba ba z ab aaaba aaab aaab aba aaba a a

Simple version of the Crochemore-Perrin (CP) algorithm Consider a non-empty prefix-suffix factorization X = u v The local period is the shortest z such that z is suffix of u or vice versa and z is a prefix of v or vice versa u z z v (u,v) length z of the local period Example: X = abaaaba a baaaba ba ba X = u v ab aaaba aaab aaab z aba aaba a a

Simple version of the Crochemore-Perrin (CP) algorithm Consider a non-empty prefix-suffix factorization X = u v The local period is the shortest z such that z is suffix of u or vice versa and z is a prefix of v or vice versa u z z v (u,v) length z of the local period Example: X = abaaaba a baaaba ba ba ab aaaba aaab aaab X = u v aba aaba a a z

Simple version of the Crochemore-Perrin (CP) algorithm Consider a non-empty prefix-suffix factorization X = u v The local period is the shortest z such that z is suffix of u or vice versa and z is a prefix of v or vice versa (u,v) length of the local period Critical factorization if (u,v) = (X) [len. of the period of X]

Example: X = u v a baaaba ba ba z ab aaaba aaab aaab aba aaba a a

Example: a baaaba ba ba X = u v ab aaaba aaab aaab z aba aaba a a

Example: a baaaba ba ba X = u v ab aaaba aaab aaab z aba aaba a a abaa

Example: a baaaba ba ba ab aaaba aaab aaab X = u v aba aaba a a z

Example: a baaaba ba ba ab aaaba aaab aaab aba aaba a a Critical Factorization Theorem (Cesari and Vincent): Among (X) - 1 consecutive factorizations: at least one is a critical factorization

Example: a baaaba ba ba ab aaaba aaab aaab aba aaba a a Critical Factorization Theorem (Cesari and Vincent): Among (X) - 1 consecutive factorizations: at least one is a critical factorization There always exists a critical factorization X = u v such that u < (X)

Crochemore-Perrin (CP) Algorithm: Take such a critical factorization of the pattern X = u v

Crochemore-Perrin (CP) Algorithm: Take such a critical factorization of the pattern X = u v Forward scan: match v left-to-right with the current aligned portion of the text

Crochemore-Perrin (CP) Algorithm: Take such a critical factorization of the pattern X = u v Forward scan: match v left-to-right with the current aligned portion of the text Back fill: match u left-to-right with the current aligned portion of the text [originally right-to-left]

Crochemore-Perrin (CP) Algorithm: Take such a critical factorization of the pattern X = u v Forward scan: match v left-to-right with the current aligned portion of the text Back fill: match u left-to-right with the current aligned portion of the text [originally right-to-left]

We propose a simple way to combine the two features Take a simple version of the constant-space Crochemore- Perrin (CP) algorithm Make CP also real-time by running two instances simultaneously

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill X = ab aaaba critical factorization abaaaba abaabaaabaa

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill X = ab aaaba critical factorization abaaaba abaabaaabaa

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill X = ab aaaba critical factorization abaaaba abaabaaabaa

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill X = ab aaaba critical factorization abaaaba abaabaaabaa

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill X = ab aaaba critical factorization abaaaba abaabaaabaa abaaaba abaabaaabaa (and charge the O( z +1) cost to the symbols in z in real time)

By contradiction, suppose there is a valid shift that is shorter...... recall that u < π(x), the length of the period u v u v

By contradiction, suppose there is a valid shift that is shorter...... recall that u < π(x), the length of the period u v π(x) u v

By contradiction, suppose there is a valid shift that is shorter...... recall that u < π(x), the length of the period u v π(x) u v Contradiction: a local period at u v that is shorter than π(x)!!

By contradiction, suppose there is a valid shift that is shorter...... recall that u < π(x), the length of the period u v π(x) u v Contradiction: a local period at u v that is shorter than π(x)!! It follows from the Crochemore-Perrin result [other case π(x) not displayed: periodicity rules out occurrences]

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill Output an occurrence when the forward scan terminates (and interrupt the back fill if needed)

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill Output an occurrence when the forward scan terminates (and interrupt the back fill if needed) Let z be the matched prefix of v, where X = u v is c.f.: if z v shift by z +1 positions and reset z = empty if z = v shift by π(x) positions and update z

Basic Real-Time Algorithm Interleave O(1) comparisons from the forward scan with O(1) comparisons from the back fill Output an occurrence when the forward scan terminates (and interrupt the back fill if needed) Let z be the matched prefix of v, where X = u v is c.f.: if z v shift by z +1 positions and reset z = empty if z = v shift by π(x) positions and update z Total cost is O(1) worst-case per symbol: the algorithm is real-time

Q: What if u > v?

Q: What if u > v? u v

Real-Time Variation of CP Consider a 3-way non-empty factorizaton X = u v w such that X = (uv) w is a critical factorization with uv w OR X = (uv) w is a critical factorization, and X' = u (vv') is a critical factorization for a prefix X' of X with u vv'

Real-Time Variation of CP Consider a 3-way non-empty factorizaton X = u v w such that X = (uv) w is a critical factorization with uv w OR X = (uv) w is a critical factorization, and X' = u (vv') is a critical factorization for a prefix X' of X with u vv'

Real-Time Variation of CP Consider a 3-way non-empty factorizaton X = u v w such that X = (uv) w is a critical factorization with uv w OR X = (uv) w is a critical factorization, and X' = u (vv') is a critical factorization for a prefix X' of X with u vv'

Real-Time Variation of CP X = (uv) w is a critical factorization, and X u v w u v v' Recall we may leave a "hole" to the left of w: this hole has to be covered by X'...

Real-Time Variation of CP X = (uv) w is a critical factorization, and X' = u (vv') is a critical factorization for a prefix X' of X with u vv' X X' u v w u v v' Note that X' is entirely matched since u vv'

Real-Time Variation of the CP Algorithm Interleave O(1) steps of two instances of the Basic Real-Time Algorithms, one looking for X and the other for X', aligned with X - X' positions apart.

Real-Time Variation of the CP Algorithm Interleave O(1) steps of two instances of the Basic Real-Time Algorithms, one looking for X and the other for X', aligned with X - X' positions apart. Total cost is O(1) worst-case per symbol: the algorithm is real-time and reports correctly all the occurrences Simple pseudocode

Pattern preprocessing GOAL: Find the desired 3-way non-empty factorizaton X = u v w and the length of the periods of X and X'

Pattern preprocessing GOAL: Find the desired 3-way non-empty factorizaton X = u v w and the length of the periods of X and X'

Some more definitions... A factorization u v is left-external if u (u,v) for non-empty u, v u v Define L(X) = { u v : X = u v is left-external } L(X) non-empty because of the Critical Factorization Theorem

Pattern preprocessing Let X = u 1 w be the first critical factorization in L(X) HINT: use CP preprocessing on the prefixes of X Lemma: u v L(X) prefix X' = u' v' s.t. (u',v') = (u,v)

Pattern preprocessing Let X = u 1 w be the first critical factorization in L(X) HINT: use CP preprocessing on the prefixes of X Lemma: u v L(X) prefix X' = u' v' s.t. (u',v') = (u,v) Compute CP critical factorization for u 1 = u v where u (u,v)

Pattern preprocessing Let X = u 1 w be the first critical factorization in L(X) HINT: use CP preprocessing on the prefixes of X Lemma: u v L(X) prefix X' = u' v' s.t. (u',v') = (u,v) Compute CP critical factorization for u 1 = u v where u (u,v) Extend u 1 by periodicity (u,vw) < vw : set X' = u (vv') where v' prefix of w

Pattern preprocessing Let X = u 1 w be the first critical factorization in L(X) HINT: use CP preprocessing on the prefixes of X Lemma: u v L(X) prefix X' = u' v' s.t. (u',v') = (u,v) Compute CP critical factorization for u 1 = u v where u (u,v) Extend u 1 by periodicity (u,vw) < vw : set X' = u (vv') where v' prefix of w It is u (u, v) (u, vv') = (u, vw) vv'