An Improved Query Time for Succinct Dynamic Dictionary Matching

Similar documents
Succinct dictionary matching with no slowdown

Compressed Indexes for Dynamic Text Collections

Applications of Succinct Dynamic Compact Tries to Some String Problems

Lempel-Ziv Compressed Full-Text Self-Indexes

Lecture 5: Suffix Trees

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Space-efficient representations of non-linear objects

A Linear Size Index for Approximate String Matching

Succinct Representation of Separable Graphs

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays

Applications of Suffix Tree

arxiv: v3 [cs.ds] 29 Jun 2010

Efficient Data Structures for the Factor Periodicity Problem

LCP Array Construction

Linear-Space Data Structures for Range Frequency Queries on Arrays and Trees

Succinct Data Structures: Theory and Practice

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

An introduction to suffix trees and indexing

Suffix Array Construction

Exact Matching: Hash-tables and Automata

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Space-efficient Construction of LZ-index

Space-efficient Algorithms for Document Retrieval

Multiple-Pattern Matching In LZW Compressed Files Using Aho-Corasick Algorithm ABSTRACT 1 INTRODUCTION

Practical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based Compressed Text Indices

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Lecture 7 February 26, 2010

Dynamic dictionary matching and compressed suffix trees. Proceedings Of The Annual Acm-Siam Symposium On Discrete Algorithms, 2005, p.

Small-Space 2D Compressed Dictionary Matching

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Top-k Document Retrieval in External Memory

Data Compression Techniques

Fast Prefix Search in Little Space, with Applications

A Compressed Text Index on Secondary Memory

Small-Space 2D Compressed Dictionary Matching

Data structures for string pattern matching: Suffix trees

Compressed Dynamic Tries with Applications to LZ-Compression in Sublinear Time and Space

Burrows Wheeler Transform

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Document Retrieval on Repetitive Collections

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Suffix trees and applications. String Algorithms

Theoretical Computer Science

Lecture 18 April 12, 2005

Modeling Delta Encoding of Compressed Files

Colored Range Queries and Document Retrieval 1

Path Queries in Weighted Trees

University of Waterloo CS240R Winter 2018 Help Session Problems

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

University of Waterloo CS240R Fall 2017 Review Problems

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Optimal Succinctness for Range Minimum Queries

CSCI 104 Log Structured Merge Trees. Mark Redekopp

arxiv: v7 [cs.ds] 1 Apr 2015

Cache-Oblivious Index for Approximate String Matching

Given a text file, or several text files, how do we search for a query string?

Giri Narasimhan. COT 6936: Topics in Algorithms. The String Matching Problem. Approximate String Matching

University of Waterloo CS240 Spring 2018 Help Session Problems

CSCI 104 Tries. Mark Redekopp David Kempe

A Linear-Time Burrows-Wheeler Transform Using Induced Sorting

A Lempel-Ziv Text Index on Secondary Storage

Modeling Delta Encoding of Compressed Files

A fast compact prefix encoding for pattern matching in limited resources devices

Approximate string matching using compressed suffix arrays

Indexing and Searching

Lecture L16 April 19, 2012

Linear Work Suffix Array Construction

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Compressed Full-Text Indexes

String Matching Algorithms

Space-Efficient Framework for Top-k String Retrieval Problems

AtCoder World Tour Finals 2019

Improved and Extended Locating Functionality on Compressed Suffix Arrays

Indexing and Searching

Combined Data Structure for Previous- and Next-Smaller-Values

Submatrix Maximum Queries in Monge Matrices are Equivalent to Predecessor Search. Pawel Gawrychowski, Shay Mozes, Oren Weimann. Max?

Two Dimensional Dictionary Matching

Space Efficient Data Structures for Dynamic Orthogonal Range Counting

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Compressed Data Structures with Relevance

Embeddings of Metrics on Strings and Permutations

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Colored Range Queries and Document Retrieval

Suffix Trees and Arrays

Compressed Cache-Oblivious String B-tree

Optimal Parallel Randomized Renaming

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

Space Efficient Linear Time Construction of

Indexing and Searching

Linked Dynamic Tries with Applications to LZ-Compression in Sublinear Time and Space

Encoding Data Structures

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

String quicksort solves this problem by processing the obtained information immediately after each symbol comparison.

Permuted Longest-Common-Prefix Array

Transcription:

An Improved Query Time for Succinct Dynamic Dictionary Matching Guy Feigenblat 12 Ely Porat 1 Ariel Shiftan 1 1 Bar-Ilan University 2 IBM Research Haifa labs

Agenda An Improved Query Time for Succinct Dynamic Dictionary Matching- CPM 2014 Dictionary Matching Aho-Corasick Motivation Succintness Dynamic Dictionary Matching Dynamic Dictionary Matching Algorithm Succinct Aho-Corasick automaton construction using a compact working space 2

Dictionary Matching is the alphabet, = σ Input: Output: Text T = t 1 t 2 t m Dictionary of patterns D = {P 1, P 2,, P d }, D =n All characters in patterns and text belong to All occurrences of patterns in T

Dictionary Matching Solution Run KMP algorithm d times Very inefficient run time of O( T d) (assuming T > D ) Alternative solution- Generalized version of KMP Aho-Corasick Space O( D ) words Query time O( T + occ)

Aho-Corasick (1975) Goto function a trie of patterns (with n states) Failure function for each pattern s prefix the longest suffix which is a prefix of any pattern Dictionary: he, she, his, hers h e r 0 1 2 8 s 9 Report for each state, list of all patterns which end there s i h 3 4 s 6 7 e 5 GoTo Function Report Function i 2 9 7 5 ow r(i) 1 1 1 1 0 Failure Function i 1 2 3 4 5 6 7 8 9 f(i) 0 0 0 1 2 0 3 0 3

Research Motivation Reduce the storage O(n) words = O(nlogn) bits Can be up to O(logn) more than patterns length For small Alphabet the space can be O(n log n) bits vs. O(n) input.

Succintness A succinct data structure uses an amount of space that is almost the information-theoretic lower bound. Say M the information-theoretical number bits need to store some data, then : Implicit Succinct Compact Compressed Uses M + O(1) bits Uses M + o(m) bits Uses O(M) Depends on the entropy of the data being stored H 0 i 1 n i log s ni s

Static Solution - Belazzougui 2010 Encode the automaton Use 3 kind of succinct data structures: Succinct indexable dictionary (Goto) Succinctly encoded navigable tree (Failure) Succinctly encoded integer array (Report) Space n(log σ +O(1)) +O(d log n/d) bits Optimal query time O( T + occ)

Comparison of results Bits of space Hon 08 O ( nlog( )) Hon 08 nlog( )(1 o(1)) O( d log( n)) Sadakane 00 nlog( )(2 o(1)) O( d log( n)) Belazzougui 10 n n(log( ) 3.443 o(1)) O( d log( )) d Time Hon 08 O( T log log( n) occ) Hon 08 O( T log ( n) log( d)) occ) Sadakane 00 O( T log( d) log( )) occ) Belazzougui 10 O( T occ)

Dynamic Dictionary Matching is the alphabet, = σ Input: Text T = t 1 t 2 t m Dictionary of patterns D = {P 1, P 2,, P d }, D =n All characters in patterns and text belong to. Output: Patterns can be inserted or removed All occurrences of patterns in T

Solution for the dynamic case Too hard to have one succinct DS insertion and deletion should utilize only succinct working set Idea: Conceptually divide the patterns into hierarchy of groups according to their length. On insertion pattern will be added to a group according to its length. When a group becomes full - merge it to a larger group.

Main Contribution Succinct dictionary matching structure with only O(loglogn) 2 slow down in query time (for poly-logarithmic alphabet) Insertions and deletions in O(1/ε logn) slow down Succinct Aho-Corasick automaton construction with only a compact working space

Comparison of results (dynamic case) where n is the total length of d patterns, σ is the alphabet size and 0<ε<1. The space is given in bits [8] Chan, Hon, Lam, Sadakane: Dynamic dictionary matching and compressed suffix trees. SODA 2005 [7] Hon, Lam, Shah, Tam, Vitter : Succinct index for dynamic dictionary matching. ISAAC 2009

Dynamic Hierarchy The hierarchy is divided into four groups: Group XL (extra large) patterns_length Group L (Large) patterns_length < Group M (Medium) < patterns_length < Group S (Small) patterns_length

Some Observations 1) The total length of the patterns in groups Medium and Small is relatively Small. Not restricted to use only succinct data-structures 2) It is possible to utilize static structures Defer insertions and deletions. Use static DS in each block, reconstruct it every once in a while 3) Reconstruct the structure in segments Not restricted to use succinct DS relative to the size of the segment

Extra Large Group By definition, there can be at most loglogn such patterns. utilize the constant space KMP of [Rytter02] for each pattern separately Query time - O( T log log n + occ) Insertion or deletion- for pattern P, O( P )

Large Group Divide the patterns in this group into loglogn levels: Pattern in the i-th level is either: Ti is the total length of patterns in level i 1. a pattern that was originally inserted into this level. 2. a pattern that was moved into this level from lower levels Each level contains two static succinct Aho-Corasick automatons

Large Group Query: Iterate over all of the levels and search for the pattern. The total time is O( T loglogn * logσ): loglogn levels query each Aho-Corasick in O( T log σ) Delete: First locate the pattern in the sub structures; then, we remove the pattern from the Aho-Corasick automaton in O(1) time. Remove only from the report tree

Large Group Insertion: Pattern P is inserted. 1. Construct a succint Aho-Corasick automaton 2. Insert it into level j, chosen according to its size. More than two structures in level j: Merge them by building a new Aho-Corasick automaton that contains all of the patterns. Can be done in compact working space. How? Put the resulting structure into an upper level

Large Group Time analysis: the amortized insert cost is O( P log n) Space analysis: The total size is

Medium and Small Groups Small group cannot exceed o(n) Medium group can be moved to smallest level in Large group when it becomes full We are not restricted to succinct DS Use traditional dynamic algorithm Such as [Sahinalp and Vishkin 96 ], which is dynamic and optimal in time Query time - O( T + occ) Insertion or deletion- for pattern P, O( P )

Putting it All Together Problems: Groups boundaries change Patterns are not actually deleted Based on standard amortization argument, we reconstruct the DS every pre defined threshold Query time O( T loglogn logσ + occ) Insert - O(1/ε P logn) Delete O(1/ε P logn)

Remaining Obstacle Construct a succinct Aho-Corasick automaton using only compact working space A sketch is described in our paper Complete construction is left for the full paper

Summary Open Questions Can the dynamic case be further improved? Can we reduce insertion/deletion time without affecting the query time? Any questions?