Suffix-based text indices, construction algorithms, and applications.

Similar documents
Space Efficient Linear Time Construction of

Indexing and Searching

Lecture 7 February 26, 2010

Suffix trees and applications. String Algorithms

Indexing and Searching

Indexing and Searching

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Linear Work Suffix Array Construction

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

4. Suffix Trees and Arrays

A Tale of Three Algorithms: Linear Time Suffix Array Construction

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Applications of Suffix Tree

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Lecture L16 April 19, 2012

4. Suffix Trees and Arrays

Applications of Succinct Dynamic Compact Tries to Some String Problems

Suffix Tree and Array

Data Compression. Guest lecture, SGDS Fall 2011

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Suffix Trees and Arrays

University of Waterloo CS240 Spring 2018 Help Session Problems

Suffix Array Construction

Data structures for string pattern matching: Suffix trees

Data Structures and Algorithms. Course slides: Radix Search, Radix sort, Bucket sort, Patricia tree

CS : Data Structures

An introduction to suffix trees and indexing

University of Waterloo CS240R Winter 2018 Help Session Problems

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

University of Waterloo CS240R Fall 2017 Review Problems

How many leaves on the decision tree? There are n! leaves, because every permutation appears at least once.

marc skodborg, simon fischer,

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Suffix Arrays Slides by Carl Kingsford

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Lecture 5: Suffix Trees

String quicksort solves this problem by processing the obtained information immediately after each symbol comparison.

BUNDLED SUFFIX TREES

L02 : 08/21/2015 L03 : 08/24/2015.

CSE 373 Lecture 19: Wrap-Up of Sorting

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

CSE 326: Data Structures Sorting Conclusion

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Quiz 1 Practice Problems

Integer Algorithms and Data Structures

Lecture 6: Hashing Steven Skiena

COMP Analysis of Algorithms & Data Structures

Data Structure and Algorithm Midterm Reference Solution TA

Sankalchand Patel College of Engineering - Visnagar Department of Computer Engineering and Information Technology. Assignment

Module 2: Classical Algorithm Design Techniques

A Simple Parallel Cartesian Tree Algorithm and its Application to Parallel Suffix Tree Construction

DO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way.

Suffix Trees on Words

Frequency Covers for Strings

Lecture 18 April 12, 2005

Data Compression Techniques

Suffix Vector: A Space-Efficient Suffix Tree Representation

Small-Space 2D Compressed Dictionary Matching

Figure 1. The Suffix Trie Representing "BANANAS".

UNIT III BALANCED SEARCH TREES AND INDEXING

Algorithms and Data Structures Lesson 3

Data Structures and Algorithms 2018

BLAST & Genome assembly

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

11/5/13 Comp 555 Fall

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

Given a text file, or several text files, how do we search for a query string?

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Searching a Sorted Set of Strings

( ) + n. ( ) = n "1) + n. ( ) = T n 2. ( ) = 2T n 2. ( ) = T( n 2 ) +1

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Burrows Wheeler Transform

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Dynamic indexes vs. static hierarchies for substring search

Fast Optimal Algorithms for Computing All the Repeats in a String

Lower Bound on Comparison-based Sorting

The Limits of Sorting Divide-and-Conquer Comparison Sorts II

Succinct dictionary matching with no slowdown

Chapter 8 Sorting in Linear Time

Lempel-Ziv Compressed Full-Text Self-Indexes

Count Sort, Bucket Sort, Radix Sort. CSE 2320 Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington

SPEEDING UP INDEX CONSTRUCTION WITH GPU FOR DNA DATA SEQUENCES

Priority Queues. T. M. Murali. January 29, 2009

Design and Analysis of Algorithms

Chapter 12: Indexing and Hashing. Basic Concepts

Advanced Database Systems

W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Compiler Construction

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Transcription:

Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in Montréal May 25-28, 2009 CanaDAM2009 slide 1/32

inverted file (postings file or inverted index) is an index data structure storing a mapping from words, to its membership in the given text (or a set of texts). The purpose is to allow word search. T 0 = CanaDAM is a great conference T 1 = A conference really great T 2 = Best conference vocabulary (small space) Canadam is a great conference really best {0} {0} {0,1} {0,1} {0,1,2} {1} {2} occurences (big space) CanaDAM2009 slide 2/32

Searching for great CanaDAM: {0,1} {0}={0} directs us to the text T 0 Not all words might be stored, not all forms of words might be stored (lower/upper case, plurals, etc.) May contain address vocabulary (small space) Canadam is a great conference really best {0 } {0 } {0,1 } {0,1 } {0,1,2 } {1 } {2 } occurences (big space) CanaDAM2009 slide 3/32

To reduce space, blocking may be used (fewer pointers, so pointers can be smaller) T0 = CanaDAM is a great conference T1 = A conference really great T2 = Best conference vocabulary (small space) Canadam is a great conference really best {0 } {0 } {0,1 } {0,1 } {0,1,2 } {1 } {2 } occurences (big space) CanaDAM2009 slide 4/32

stopword = frequently occurring words that carry no meaning e.g. the a an CanaDAM2009 slide 5/32

Building an inverted file is a relatively cheap task: O(n) vocabulary can be maintained as hash table, trie, or B-tree, however a simple list in lexicographic order is better for space and competitive in search (binary search O( log n)) Practical figures show space requirement and text to be traversed sublinear (close to O(n 0.85 ) ). No other index can do that. Disadvantage - queries for phrases are expensive to perform. Since the text is viewed as a sequence of words, no subword search possible (not good, for instance, for analyses of DNA or protein sequences). CanaDAM2009 slide 6/32

Suffix-based indices: suffix trees and suffix arrays allow equal retrieval of any subword -- suitable for substring problem. The task is to identify all occurrences of a substring fast and efficiently. Instead of re-scanning the string every time we are looking for a pattern, we prepare a data structure to do the search easily. The basic idea -- any substring is a prefix of a suffix: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b CanaDAM2009 slide 7/32

A common data structure trie (pronounced try) for retrieval A trie on a set X ={x 1, x 2,.. x n } of pairwise distinct strings is a search tree with n+1 leaves. The edges are labelled by characters, a path to a leaf spells a string from X. For technical reasons, each strings is terminated by the lexicographically smallest sentinel symbol $. (For C/C++ aficionados, think of $ as NULL) CanaDAM2009 slide 8/32

$ $ a b b c $ c $ ab bc abc $ A trie on X = { ab, abc, ba } CanaDAM2009 slide 9/32

Patricia trie (compacted trie, radix tree) all internal nodes of degree 2 are eliminated and the edges are labelled by substrings $ ab bc $ ab $ c $ bc abc $ A Patricia trie on X = { ab, abc, ba } (Practical Algorithm To Retrieve Information Coded In Alphanumeric) Morrison (1968) CanaDAM2009 slide 10/32

Suffix tree of a string x = Patricia trie of the set of all nontrivial suffixes of x Weiner (1973) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b $ a 0 b c a 1 1 1 c b $ ca a a b$ 2 2 2 18 2 3 2 17 11 ccabaacb$ cb$ b$ acb$ cabaacb$ baccabaacb$ ba 4 15 16 10 14 9 2 6 a ca 3 4 bcabaccabaacb$ cabaacb$ abcabaccabaacb$ abcabaccabaacb$ 3 4 acb$ ccabaacb$ baccabaacb$ 12 7 acb$ ccabaacb$ 13 8 1 5 abcabaccabaacb$ CanaDAM2009 slide 11/32

Applications -- search for substrings in a string: Check if a string P of length m is a substring in O(m log α), where α is the size of the alphabet. Find the first occurrence of the patterns P 1,...,P q of total length m as substrings in O(m log α). Find all k occurrences of the patterns P 1,...,P q of total length m as substrings in O ((m + k) log α). Find the longest common prefix between two suffixes in Θ(log α) (requires preprocessing of the tree) Find all k tandem repeats in O((n log n + k) log α) CanaDAM2009 slide 12/32

Applications -- determine properties of a string: Find the longest common substrings of strings s 1 and s 2 in Θ( s 1 + s 2 ). Find all k maximal repeats in Θ(n+k). Find the Lempel-Ziv decomposition in Θ(n). Find the longest repeated substrings in Θ(n). Find the most frequently occurring substrings of a minimum length in Θ(n). CanaDAM2009 slide 13/32

To construct a suffix tree naïve iterative algorithm O(n 2 ) smarter constructions in O(n log n) - iterative: Weiner (1973), McCreight (1976) -faster and less memory, Crochemore (1981) - via all maximal repetitions, Ukkonen (1995) - suffix links, online These are linear, if alphabet size is fixed, i.e. for small alphabets. complex construction in O(n) for any indexed alphabet - recursive: Farach (1997) CanaDAM2009 slide 14/32

1 The basic ideas of Farach s construction Construct suffix tree for odd suffixes of the input string x by recursion: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b $ 2 4 2 4 3 6 2 1 5 Use radix sort to sort the pairs (x[i],x[i+1]) for odd i: aa=>1, ab=>2, ba=>3, ca=>4, cb=>5, cc=>6 Create a new string y = 2 4 2 4 3 6 2 1 5 $ and by a recursive call, obtain its suffix tree CanaDAM2009 slide 15/32

1 2 3 4 5 6 7 8 9 Suffix tree for y = 2 4 2 4 3 6 2 1 5 $ 0 6215$ 8 15$ 15$ 1 2 36215$ 4 5 1 4 2436215$ 5$ 36215$4 9 6 7 2436215$ 2 36215$4 2 4 1 3 Massage it into a suffix tree of odd suffixes of x in linear time: CanaDAM2009 slide 16/32

a 0 abccabaacb$ c 1 9 1 15 acb$ b a cabaacb$ b$ 2 2 17 11 aacb$ ca abcabaccabaacb$ 3 7 baccabaacb$ 13 4 abcabaccabaacb$ baccabaacb$ 1 5 2 From this tree create the suffix tree for even suffixes, also in linear time (again using radix sort): CanaDAM2009 slide 17/32

a 0 b cabaacb$ abcabaccabaacb$ 4 1 1 c aacb$ 8 2 18 14 3 $ ca 12 baccabaacb$ b$ cabaacb$ 16 10 2 6 baccabaacb$ abcabaccabaacb$ 3 Now merge these two suffix trees into one, also in linear time. CanaDAM2009 slide 18/32

The problem with suffix tree --- too much memory! 5 x to 10 x machine words required - Kurtz (1999) reduced suffix tree! The construction also requires a lot of additional (working) memory. This is unfeasible and impractical for large strings (e.g. DNA - tens/hundreds of millions of letters ). Manber+Mayers (1993) introduced suffix arrays as an alternative to suffix trees. CanaDAM2009 slide 19/32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 suff lcp 4 15 13 8 1 5 16 10 18 14 9 2 6 3 12 7 17 11 2 1 3 2 4 1 2 0 1 2 1 3 0 2 4 1 1 a 0 b c a 1 1 1 c b $ ca a a b$ 2 2 2 18 2 3 2 17 11 ccabaacb$ cb$ b$ acb$ cabaacb$ baccabaacb$ ba 4 15 16 10 14 9 2 6 a ca 3 4 bcabaccabaacb$ cabaacb$ abcabaccabaacb$ abcabaccabaacb$ 3 4 acb$ ccabaacb$ baccabaacb$ 12 7 acb$ ccabaacb$ 13 8 1 5 abcabaccabaacb$ CanaDAM2009 slide 20/32

M+M: Search for a substring u in O( u log n), construct suff in O(n log n), expected O(n), construct lcp in O(n log n), expected O(n). Kasai et al (2001): linear time algorithm to compute lcp from suff. Problem reduced to suffix sorting. Abouelhoda et al (2002): search for u in O( u ), with additional linear time preprocessing. Problems requiring top-down or bottom-up traversal of suffix tree can be performed with the same asymptotic complexity using suffix arrays. CanaDAM2009 slide 21/32

Suffix sorting in linear time Three papers came out in 2003 giving linear time recursive algorithms for suffix sorting. They all tried Farach s approach: split suffixes into G 1 and G 2 1. sort G 1 using recursive reduction of the problem 2. sort G 2 using the order of G 1 3. merge G 1 and G 2 Kärkkäinen+Sanders: the simplest, the most elegant, the most memory efficient. The question is: how fast? CanaDAM2009 slide 22/32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b Suppose to have all brown suffixes ( and ) sorted. Then 6 ~ 9 determined by x[6] ~ x[9], or if x[6]=x[9], determined by ~ 7 10 Thus radix sort with keys of size 2 will do. So we have all gray suffixes sorted. How to merge brown and gray suffixes? CanaDAM2009 slide 23/32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b Simple comparison-based merge: ~ determined by the first letter or by ~ ~ determined by the first letter or by ~ So, how to sort the brown suffixes? Like Farach! CanaDAM2009 slide 24/32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b C A H E B D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c a a b c a b a c c a b a a c b G C B J F I Using radix sort, sort the triples. CanaDAM2009 slide 25/32

1 4 7 10 13 16 2 5 8 11 14 17 C A H E B D G C B J F I A H E B D G C B J F I 4 B D G C B J F I 13 B J F I 8 C A H E B D G C B J F I 1 C B J F I 5 D G C B J F I 16 E B D G C B J F I 10 F I 14 G C B J F I 2 H E B D G C B J F I 7 I 17 CanaDAM2009 slide 26/32

This approach works for any division as long as the beige blocks are bigger than the gray blocks. Of course, using bigger beige blocks requires longer radix sort, however it decreases the recursion and memory use. In many ways, using blocks of size 3 optimizes the solution, see results of a crude simulation: CanaDAM2009 slide 27/32

N=100 2+1: total=1415.000000,rec=8,mem=943.333333 3+2: total=1541.866667,rec=6,mem=566.400000 N=1000 2+1: total=14890.000000,rec=14,mem=9926.666667 3+2: total=16202.666667,rec=10,mem=5952.000000 N=10000 2+1: total=149855.000000,rec=20,mem=99903.333333 3+2: total=163209.200000,rec=15,mem=59954.400000 N=100000 2+1: total=1499840.000000,rec=25,mem=999893.333333 3+2: total=1633170.000000,rec=19,mem=599940.000000 N=1000000 2+1: total=14999770.000000,rec=31,mem=9999846.666667 3+2: total=16333150.400000,rec=24,mem=5999932.800000 N=10000000 2+1: total=14999770.000000,rec=31,mem=9999846.666667 3+2: total=16333150.400000,rec=24,mem=5999932.800000 CanaDAM2009 slide 28/32

Are the linear suffix sorting algorithms practical? They seem to require at least 4 x working memory. Larsson+Sadakane (1999): sorting suffixes as independent strings - for most real-world data very fast, though worst-case complexity is Ω(n 2 ), requires very little extra space (for instance bzip2 by Seward). Manzini+Ferragina (2002): very fast, very little extra memory (0.03n), however worst-case complexity is Ω(n 2 ). They posed a problem: lightweight (O(n log n), sublinear memory) algorithm? CanaDAM2009 slide 29/32

Burkhardt+Kärkkäinen (2003): an O(n log n) suffix sorting algorithms with O(n / log n ) memory requirement. Based on the idea of difference covers (VLSI design, distributed mutual exclusion -- Colbourn+Ling (2000)). For any pair of suffixes x[i..n], x[j..n] find the smallest k such that the order of x[i+k..n] and x[j+k..n] is known (anchor pair). A difference cover D modulo v: set of integers 0..v-1 such that for any 0 < i < v there are i 1, i 2 D so that i = i 1 -i 2 (mod v). For i, j compute k=δ(i, j) [0,v) so that ((i+k) mod v), and ((j+k) mod v) are both in D (can be done in O(v)). CanaDAM2009 slide 30/32

Then sort all suffixes whose starting position is in D. The sort of all suffixes is transformed to a sort on keys of length v+1. Note that Kärkkäinen+Sanders algorithm uses D modulo 3! Colbourn+Ling (2000): For every v, a difference cover D modulo v of size D 1.5v+6 can be computed in O( v) time. Currently the fastest linear suffix sorting algorithm is due to Maniscalco+Puglisi 2008. CanaDAM2009 slide 31/32

Can suffix array really replace the string? Bannai, Inenaga, Shinohara, and Takeda (2003): given an array, conditions can be checked if it is a suffix array of a string (must be a permutation of n) and such a string with a minimal alphabet is inferred from the array in O(n) time. CanaDAM2009 slide 32/32