Given a text file, or several text files, how do we search for a query string?

Similar documents
Lecture 7 February 26, 2010

Lecture L16 April 19, 2012

Lecture 18 April 12, 2005

Lecture 5: Suffix Trees

Data structures for string pattern matching: Suffix trees

An introduction to suffix trees and indexing

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching Algorithms

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Suffix Tree and Array

Space Efficient Linear Time Construction of

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Indexing and Searching

4. Suffix Trees and Arrays

Lecture 6: Suffix Trees and Their Construction

Assignment 2 (programming): Problem Description

Efficient Implementation of Suffix Trees

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

4. Suffix Trees and Arrays

Algorithms and Data Structures

Suffix Vector: A Space-Efficient Suffix Tree Representation

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

String Matching Algorithms

A Suffix Tree Construction Algorithm for DNA Sequences

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

BUNDLED SUFFIX TREES

CSCI S-Q Lecture #13 String Searching 8/3/98

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix trees and applications. String Algorithms

Algorithms and Data Structures Lesson 3

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Problem Set 9 Solutions

Suffix Trees on Words

Suffix Trees and its Construction

Lecture 3 February 9, 2010

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

String matching algorithms

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

kvjlixapejrbxeenpphkhthbkwyrwamnugzhppfx

Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees

Ultra-succinct Representation of Ordered Trees

Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays

HEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see

String Patterns and Algorithms on Strings

Advanced Algorithms: Project

Advanced Data Structures - Lecture 10

Accelerating Protein Classification Using Suffix Trees

Two Dimensional Dictionary Matching

String Processing Workshop

Indexing and Searching

COMP4128 Programming Challenges

Application of the BWT Method to Solve the Exact String Matching Problem

Representing Dynamic Binary Trees Succinctly

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES

String Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32

Path Queries in Weighted Trees

Sorted Arrays. Operation Access Search Selection Predecessor Successor Output (print) Insert Delete Extract-Min

Suffix Trees and Arrays

Figure 1. The Suffix Trie Representing "BANANAS".

In-memory processing of big data via succinct data structures

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Data Structure and Algorithm Midterm Reference Solution TA

Small-Space 2D Compressed Dictionary Matching

Longest Common Substring

A Sub-Quadratic Algorithm for Approximate Regular Expression Matching

Fast Substring Matching

Indexing and Searching

Fast Exact String Matching Algorithms

Lempel-Ziv Compressed Full-Text Self-Indexes

11/5/09 Comp 590/Comp Fall

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Alphabet-Dependent String Searching with Wexponential Search Trees

Special course in Computer Science: Advanced Text Algorithms

Combined Data Structure for Previous- and Next-Smaller-Values

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

11/5/13 Comp 555 Fall

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

Enhanced Component Retrieval Scheme Using Suffix Tree in Object Oriented Paradigm

Improving Suffix Tree Clustering Algorithm for Web Documents

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Suffix-based text indices, construction algorithms, and applications.

String Matching. Geetha Patil ID: Reference: Introduction to Algorithms, by Cormen, Leiserson and Rivest

Compressed Indexes for Dynamic Text Collections

DYNAMIC PATTERN MATCHING ½

TREES. Trees - Introduction

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

MITOCW watch?v=ninwepprkdq

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Data Structures and Algorithms

Exercise 1 : B-Trees [ =17pts]

Space Efficient Data Structures for Dynamic Orthogonal Range Counting

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)

Accelerating Boyer Moore Searches on Binary Texts

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Friday Four Square! 4:15PM, Outside Gates

Transcription:

CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key searches. If we are not able to preprocess the text file (length m), the obvious method is O(mn) where n is the length of the query. Knuth-Morris-Pratt: showed how to construct a simplified automaton in O(n) time so that the query is answered in O(m) time. Indeed the method essentially operates with two pointers into the text and on each comparison one is advanced. Knuth D.E., Morris (Jr) J.H., Pratt V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(2):323-350. Boyer-Moore: give a method by which the pattern is aligned with the text, then the pattern is matched from right to left (i.e. back to front). If a mismatch is found early (near the end of the pattern string), the pattern may be advanced a substantial distance. It is thus possible to answer some queries in as little as O(m/n) time. Boyer R.S., Moore J.S., 1977, A fast string searching algorithm. Communications of the ACM. 20:762-772. Rabin-Karp: this is a simple hashing scheme. The key idea is to have a hash function on strings of length n such that Hash(ai+1 ai+n) is easily computed from Hash(ai ai+n-1). The usual approach is to use x j mod r for some values of x and the modulus r. a j Karp R.M., Rabin M.O., 1987, Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2):249-260. When text is allowed to be preprocessed: Peter Weiner (Peter Weiner: Linear Pattern Matching Algorithms. Proc. 14 th IEEE SWAT (old name of FOCS) pp. 1-11; 1973) proposed the suffix tree (he called it a position-tree). McCreight (Edward M. McCreight: A Space-Economical Suffix Tree Construction Algorithm. JACM 23(2): 262-272 (1976)) and Ukkonen (Esko Ukkonen: On-Line Construction of Suffix Trees. Algorithmica 14(3): 249-260 (1995)) gave different and more space efficient construction algorithms. See (Robert Giegerich, Stefan Kurtz: From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica 19(3): 331-353 (1997)) for a unified view of these algorithms. Gonnet et al (OED project) and Manber and Meyers (Udi Manber, Eugene W. Myers: Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22(5): 935-948 (1993); also SODA 1990) demonstrated the importance of this structure, and a cut-down form of it, for large scale text indexing. We deal with the text S[1,.m] where wlg the last character is an eof symbol $.

Suffix tree: Digital tree (or trie) in which root to leaf paths are in correspondence with suffixes of S (i.e., S[i,m] I = 1,m). All internal nodes are of degree at least two, each node is labeled with a non-null string of characters so that the root to leaf path spells out the appropriate suffix. [Sibling labels differ at the earliest possible point]. Representational issues - Denote an edge label by a pair [b,f] such that S[b,f] corresponds to the string (note that the choice of b and f is not necessarily unique). - As there are no internal nodes of degree 1, there are < m internal nodes. - The total space required (excluding S[1,m] is O(m) words of lg m bits each. - If the tree were binary (as is often done) the edge information can be reduced to the number of bits skipped (i.e. bits on which all suffixes matching thus far continue to match). Clearly we can build a suffix tree in O(m 2 ) time. Clearly we can reduce this by a straightforward sort of the suffixes. We will return to this issue of creation of a suffix tree. Applications: Given a suffix tree of the text S and a query string P of length n, it is immediately obvious that we can determine whether the string P is a substring of S. Indeed in O(n) time we can - determine whether P occurs in S, and if so (by keeping the size of a subtree at its root), the number of occurrences, in O(n) time (even if this number is >> n). In fact a reference to the appropriate subtree may be a convenient way to pass on all matches. - among other useful information the suffix tree contains the longest repeated substring. Applications abound: - text processing - bioinformatics - compression There are (at least) three linear time algorithms for creating suffix trees [Weiner 1973], [McCreight, JACM 1976] [Ukkonen, Algorithmica 1995]. All suffer from random access issues that make them costly for structures not fitting in main memory. Weiner s approach ( the original so others are time/space reducing refinements): Basic idea: For i = m downto 1 do insert suffix S[i,m] but we keep some back pointers (call this tree Ti). Where does S[i,m] branch off?

- At the longest prefix of S[i,m] that is a substring of what you have already seen i.e. S[i+1,m] - Call this Head(i) So the method is linear if we have Head(i). Unfortunately, at this stage the time per step is proportional to the length of this prefix. Finding Head(i) Key idea: keep two vectors (indicator) I and (link) L (of alphabet size) I is binary L is null or pointer Iv (x) is indicator of character x at node v Lv(x) is a reference Key properties - For any character x, node u, Iu(x) = 1 in Ti+1 iff there is a path from the root labeled x where is the path to u. (xneed not end at a node, Iu(x) = 1, Lu(x) = null handles that case). - For any character x, Lu(x) in Ti+1 points to the internal node v in Ti+1 iff v has path label xand u has label. Otherwise Lu(x) is null. Key notion: Start at leaf of Suffi+1 in Ti+1 walk up to find first node v where Iv(S(I)) = 1 and v where I v (S(i)) = 1 and Lv is non-null. Numerous details (including degenerate case where v or v don t exist). See for example Gusfield, Algorithms on Strings Trees and Sequences, for details. Problem: Space for suffix tree (even after it has been created) O(m) words (2 or 3 pointers, 2 or 3 indices) in addition to the text of m characters. In OED project, binary suffix tree was of size 5 times text even though index points only every 5th character or so. Reference to text at leaves needed (in OED case this space is roughly equal to text). But both Gonnet et al and Manber and Meyers tossed suffix tree in favour of simply keeping references to index points in the order of suffixes referred to. Hence search for string takes O(lg n) time (includes getting all matches).

This leads to the question of tree representations. How much space does it take to represent a tree (say a binary tree) on n nodes, so that we can efficiently navigate the tree? A starting point: Indexing an m character text for searches starting at the beginning of each (English) word or special symbol. m bytes of text m/5 index points Implement as (binary) suffix tree with references at leaves to text positions Naïve implementation: 2m/5 nodes, Each requiring 2 pointers = 8 bytes 1 subtree size = 4 bytes skip/flag = 1 byte 13 bytes 13 2m/5 = 5 1/5 m bytes for index on m characters. We can only reduce this by realizing the leaves are only pointers and get away with about 4 3/5 m bytes. Neither is acceptable. What is the information theoretic minimum for representing a binary tree on m nodes? There are {Proof: 2m /m1 binary trees. m i m-i-1 m1 #m #i#(mi1) i0 #0#11 } 2m /(m 1) c2 2m /m 3/2 m (c small consistent) # bits required is lg of this value, or

2m 3 lgm O(1) 2!!! and this is negative, hence recursive build up has a chance Clearly trees can be represented in 2m bits, but can we still navigate? A first approach? The 3 lgm term gives a hint. 2 To represent a subtree of size m we use Tree(m) bits. Then for i m-i-1 use rep(i) Tree(i) Tree(m-i-1) where rep(i) is the number bits to denote i, using a prefix code (we don t know anything about i other than it is less than m). Clearly 2 lgi 1 bits suffice lg 1 (But we can do better for large i). i 0 s then write i in binary. But there is a problem even with lg i bits when i is close to m e.g. Solution to this: T(m) = lg m + 0 + T(m-2) leads to an m lg m) solution. Let i = size of smaller tree, then use 1 bit to say whether small tree is left or right. Reasonable tuning gives a 3m bit representation method by which we can navigate to children and find subtree size [can actually do a bit better, but the real problem is small subtrees and especially encoding i]. Now we have a mapping of nodes to 1 m. Root is 1, then do a preorder scan of left subtree, then right. (This way we can reduce suffix tree rep to this plus leaf pointers). However, another approach leads to a 2m+o(n) bit representation and, ultimately, to more powerful navigation. Jacob [FOCS 1989], Munro and Raman [FOCS 1997]. Use a heaplike ordering on a tree, that is:

i) Append leaves to all positions ii) Read nodes level by level iii) Write 1 for an internal node, 0 for an external If the original tree has m nodes, we append m+1 external nodes, so have a 2n+1 bit representation. Key Lemma: The children of i th internal node are 2i and 2i+1. Proof: Straightforward induction. Assume for a tree and position i, now convert to position i+1 by converting external node to internal. There are a couple of details Up to and including internal node i there are i internal nodes and j external nodes. Between internal node i and its left child there are i-j more nodes. So left child is node 2i. Note: i th 1 corresponds to i th internal node in level scan i th 0 corresponds to i th external node Consider the following operations on a bit string rank(i): find #1 s up to position i select(i): find the position of i th 1 [Note: these work on any bit string so this could be used as a data structure over [1 n] for rank queries] The key issue is to answer queries quickly with this base vector and o(m) more space. But first left-child(i) = 2 rank(i) right-child(i) = 2 rank(i) + 1 parent(i) = select( i/2) This is just like a heap Doing rank: For every k th position write down rank(i k) i = 0,,m/k. This takes (m lg m)/k bits {k, the big block size, will be w(lg n), but wait} So an answer is between two spots.

For every l th position, write down rank since the preceding k th position. This takes (m lg k)/l bits {l, the little block size, is also to be defined} To get the answer we simply look at the bits since the beginning of the last small block. Let l = 1/2 lg m and use table lookup for all possible bit strings. This table has size m. So with k (lgm) 2, the big block information takes O(m/lg m) bits; and the little block information O(m lglg m / lg m) = o(m) bits. Select is trickier. Jacobson s original idea required inspecting O(lg m) bits widely scattered. Munro and Raman [FOCS 1997] avoid this problem. That paper actually uses another related approach to give subtree size. Follow up papers have dealt with ordered trees, trees of fixed degrees (quartary trees) and dynamic versions of the problem).