BUNDLED SUFFIX TREES

Size: px
Start display at page:

Download "BUNDLED SUFFIX TREES"

Transcription

1 Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science University of Trieste BITS 2005, Milano, 18 th 19 th march 2005

2 Outline Motivation 1 Motivation Suffix Trees 2 Non-Transitive Relations Definition Size and Construction 3 4

3 Introduction Motivation Suffix Trees Since the discovery of DNA, biology gave birth to many thorough string problems. Important challenge: find repeated patterns in DNA that are biologically significant. Feature: patterns are repeated with errors. (Approximate pattern discovery is difficult) Other feature (more difficult): formalization of biologically significant.

4 Suffix Trees Motivation Suffix Trees bcabbabc A Suffix Tree is a data structure which exploits the internal structure of a string. Efficient for: Exact String Matching Problem Longest Exact Common Substring Problem Identifying Exactly Repeated Patterns

5 Suffix Trees Motivation Suffix Trees bcabbabc A Suffix Tree is a data structure which exploits the internal structure of a string. They are linear in size (w.r.t text length), and can be built in linear time. E. McCreight. A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2), , E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14: , 1995.

6 Suffix Trees Motivation Suffix Trees bcabbabc Suffix Trees are not natural to deal with approximate string matching problems (positive Hamming or Edit distance) Landau G.M., Vishkin U., Efficient String Matching with k Mismatches, Theoretical Computer Science, 43, , Gusfield D., Algorithms on strings, trees and sequences, Cambridge University Press, 1997.

7 Suffix Trees Motivation Suffix Trees bcabbabc Suffix Trees are not natural to deal with approximate string matching problems (positive Hamming or Edit distance) The Longest Common Approximate Substring Problem or the extraction of approximate repeated patterns can t be solved as in the exact case.

8 Motivation Extending Suffix Trees Non-Transitive Relations Definition Size and Construction THE PROJECT Exploring the possibility of using different tree-based structures to tackle approximate string matching problems. SO FAR We developed, an extension of Suffix Trees such that: they incorporate information about errors ; they can be used for the Longest Common Approximate Substring Problem and for extracting approximate repeated patterns like Suffix Trees.

9 Motivation Extending Suffix Trees Non-Transitive Relations Definition Size and Construction THE PROJECT Exploring the possibility of using different tree-based structures to tackle approximate string matching problems. SO FAR We developed, an extension of Suffix Trees such that: they incorporate information about errors ; they can be used for the Longest Common Approximate Substring Problem and for extracting approximate repeated patterns like Suffix Trees.

10 Motivation Non-Transitive Relation Non-Transitive Relations Definition Size and Construction Character matching is a relation among letters (in fact, it is the equality relation) Approximate matching can also be modeled as a non-transitive relation among letters (bigger than equality!): two strings match if all their letters are in relation.

11 Motivation Non-Transitive Relation Non-Transitive Relations Definition Size and Construction Character matching is a relation among letters (in fact, it is the equality relation) Approximate matching can also be modeled as a non-transitive relation among letters (bigger than equality!): two strings match if all their letters are in relation.

12 Motivation Non-Transitive Relations Definition Size and Construction Non-Transitive Relation: An Example Modelling a relation based on Hamming Distance Start from a basic alphabet (e.g. binary: A = {0, 1}) Construct an alphabet composed of macrocharacters (e.g. A = {00, 01, 10, 11}) Impose that two letters x, y A are in relation if and only if d H (x, y) 1 (relation is non transitive) The Relation Graph

13 Motivation Non-Transitive Relations Definition Size and Construction Non-Transitive Relation: An Example Modelling a relation based on Hamming Distance Start from a basic alphabet (e.g. binary: A = {0, 1}) Construct an alphabet composed of macrocharacters (e.g. A = {00, 01, 10, 11}) Impose that two letters x, y A are in relation if and only if d H (x, y) 1 (relation is non transitive) The Relation Graph

14 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c We start from the suffix tree for the string bcabbabc. The alphabet is {a, b, c}, and the relation is a b c.

15 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c Let s compare suffix 3 (abbabc) and suffix 1 (bcabbabc) According to our relation, the maximal prefix of suffix 3, which is in relation with a prefix of suffix one, is abbab.

16 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c Therefore, after bcabb, we put in the tree a red node with label 3. Due to symmetry, there is also a red node with label 1 after abbab.

17 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c Therefore, after bcabb, we put in the tree a red node with label 3. Due to symmetry, there is also a red node with label 1 after abbab.

18 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c If we do this process for every couple of suffixes, we will build a Bundled Suffix Tree! Note that this data structure is in the middle between a suffix tree and a suffix trie.

19 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c If we do this process for every couple of suffixes, we will build a Bundled Suffix Tree! Note that this data structure is in the middle between a suffix tree and a suffix trie.

20 Motivation Non-Transitive Relations Definition Size and Construction Bundled Suffix Tree: An Example bcabbabc; a b c This tree can be use to solve the Longest Common Approximate Substring Problem with respect to a given relation. We just have to find the lowest red node! Similarly, we can also extract information about approximate repeated patterns.

21 How Big? Motivation Non-Transitive Relations Definition Size and Construction The number of red nodes inserted depends on: the relation the structure of the text. In the worst case, the number of red nodes is quadratic in the length of the text S. Example On average, the number of red nodes is limited by m 1+δ, δ = log 1/p + C. ( m is the length of the text, p + is the highest frequency of the most common letter in S and C depends on the relation) 1 + δ is slightly greater than one! Example

22 How Big? Motivation Non-Transitive Relations Definition Size and Construction The number of red nodes inserted depends on: the relation the structure of the text. In the worst case, the number of red nodes is quadratic in the length of the text S. Example On average, the number of red nodes is limited by m 1+δ, δ = log 1/p + C. ( m is the length of the text, p + is the highest frequency of the most common letter in S and C depends on the relation) 1 + δ is slightly greater than one! Example

23 How Fast? Motivation Non-Transitive Relations Definition Size and Construction Naive Algorithm The naive algorithm for building a BuST simply tries to match every suffix of the text along every branch of the suffix tree, until a mismatch is found. It can be quadratic in the worst case. Anyway, an analysis based on the average shape of a suffix tree, shows that its average complexity is bounded by m 1+δ (δ just slightly greater that δ). W. Szpankowski. A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors. SIAM J. Comput. 22(6): (1993) P. Jacquet, B. McVey, W. Szpankowski. Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of Depth, Journal of the Iranian Statistical Society, 3, , 2004.

24 Faster Motivation Non-Transitive Relations Definition Size and Construction Efficient Algorithm We found an McCreight-like algorithm that is linear in the size of the output. Intuitions It processes the suffixes backwards. It is based on the concept of inverse suffix links. Show Details It identifies the red nodes for suffix i by processing the red nodes for suffix i + 1. Show Details

25 The generall. concept Bortolussi, F. Fabris, of non-transitive A. Policriti BUNDLED relation SUFFIX TREES seems very Hunting TFBS Motivation We are using BuST to identify TFBS candidates in DNA sequences. The algorithm first constructs the BuST for the set of sequences under analysis, and then extracts and combines the information contained in it. The relation used is defined by an Hamming distance criterion. The algorithm is quite fast: for instance, we are able to solve the benchmark proposed by Pevzner et al. in few seconds. Show Benchmark s Details P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol. 2000;8:

26 Hunting TFBS Motivation We are using BuST to identify TFBS candidates in DNA sequences. The general concept of non-transitive relation seems very fruitful: it can be used to encode Hamming distance, but also to tackle edit distance or to encode other biologically-driven relations. G. Pavesi, G. Mauri and G. Pesole. In silico representation and discovery of transcription factor binding sites, Briefings in Bioinformatics. 5(3):1 20, 2004.

27 Conclusions Motivation We have introduced, a new data structure extending suffix trees. It can be used to extract approximate information from a string, and it is manipulated similarly to suffix trees. The structure is based on a very general concept of non-transitive relation among (macro)characters. Its size is slightly more than linear on average, and there s a fast (McCreight-like) algorithm to build it. It can be used to discover approximate patterns in a text. For instance, it can be used to identify candidates for TFBS.

28 Dimension of BuST Efficient Algorithm The Benchmark Tests We have implemented the naive algorithm for the construction of BuST. We have tested it with relations induced by hamming distance, defined over DNA-macrocharacters. With macrocharacters of size 4, such that two of them are in relation iff their Hamming distance is 1, the algorithm is quite fast, and can process texts of 100 Kb in few seconds. The number of red nodes grows with an exponent smaller than the predicted one. Show Details

29 Quadratic BuST Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Let s consider the text a }. {{.. a } b }. {{.. b } c }. {{.. c }, m m 2m over {a, b, c, d}, with a b d c The number of nodes surrounded by the red box is quadratic in m! Return

30 Quadratic BuST Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Let s consider the text a }. {{.. a } b }. {{.. b } c }. {{.. c }, m m 2m over {a, b, c, d}, with a b d c The number of nodes surrounded by the red box is quadratic in m! Return

31 Quadratic BuST Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Let s consider the text a }. {{.. a } b }. {{.. b } c }. {{.. c }, m m 2m over {a, b, c, d}, with a b d c The number of nodes surrounded by the red box is quadratic in m! Return

32 The exponent δ Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Return

33 The exponent δ Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Return

34 Test Dimension of BuST Efficient Algorithm The Benchmark Quadratic BuST Delta Tests Number of macrocharacters of length 4 over DNA alphabet. Test strings are generated according to a uniform p.d. Return

35 Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm Inverse Suffix Links A crucial role in the fast construction of suffix trees is played by suffix links. Suffix links are pointers from nodes with path label xα to nodes with path label α. Whenever there is a node with path label xα, there s also a node with path label α. Return

36 Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm Inverse Suffix Links Inverse suffix links are pointers from nodes with path label α to positions in the tree labeled xα, for each x in the alphabet such that xα is a substring of S. They can point in the middle of an arc. If a ISL takes from α to xα, it is labeled with x. Return

37 Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm Inverse suffix links con be used to identify the red nodes for suffix S[i]from the red nodes for suffix S[i + 1]. Suppose we know the location of a red node for suffix S[i + 1], and that it is just under a black node with path label α. Return

38 Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm From this node, we can cross all inverse suffix links such that S(i) is in relation with the character labeling the ISL. With a skip and count trick, we can identify the positions of red nodes for S[i]. Return

39 Dimension of BuST Efficient Algorithm The Benchmark Inverse Suffix Links Ideas of the Algorithm The Algorithm From this node, we can cross all inverse suffix links such that S(i) is in relation with the character labeling the ISL. With a skip and count trick, we can identify the positions of red nodes for S[i]. Return

40 A First Application Dimension of BuST Efficient Algorithm The Benchmark The Benchmark There is a set of 20 strings of length 1000, generated according to a uniform distribution over the DNA alphabet. There is a pattern p of length 16, such that 20 of its occurrences are implanted in the strings, with 4 mutations occurring in random positions. The problem is to identify p (the signal), given the strings. P.A. Pevzner and S.H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol. 2000;8: Return

41 A First Application Dimension of BuST Efficient Algorithm The Benchmark A solution with BuST We used macrocharacters of length 4 (2 of them are in relation if their Hamming distance is 1). We built the generalized BuST for the strings (converted in macrocharacters in every possible way). For every substring of length 16 of the 20 strings, we looked at the set of substrings in relation with it, and we combined this information to find p. It s a naive use of BuST, but it works! Return

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Combinatorial Pattern Matching

Combinatorial Pattern Matching Combinatorial Pattern Matching Outline Exact Pattern Matching Keyword Trees Suffix Trees Approximate String Matching Local alignment is to slow Quadratic local alignment is too slow while looking for similarities

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Suffix trees and applications. String Algorithms

Suffix trees and applications. String Algorithms Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x

More information

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5 Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Advanced Algorithms: Project

Advanced Algorithms: Project Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

Computing the Longest Common Substring with One Mismatch 1

Computing the Longest Common Substring with One Mismatch 1 ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

Fast and Cache-Oblivious Dynamic Programming with Local Dependencies

Fast and Cache-Oblivious Dynamic Programming with Local Dependencies Fast and Cache-Oblivious Dynamic Programming with Local Dependencies Philip Bille and Morten Stöckel Technical University of Denmark, DTU Informatics, Copenhagen, Denmark Abstract. String comparison such

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

Lecture 6: Suffix Trees and Their Construction

Lecture 6: Suffix Trees and Their Construction Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:

More information

Suffix-based text indices, construction algorithms, and applications.

Suffix-based text indices, construction algorithms, and applications. Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Applications of Succinct Dynamic Compact Tries to Some String Problems

Applications of Succinct Dynamic Compact Tries to Some String Problems Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Determining gapped palindrome density in RNA using suffix arrays

Determining gapped palindrome density in RNA using suffix arrays Determining gapped palindrome density in RNA using suffix arrays Sjoerd J. Henstra Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Abstract DNA and RNA strings contain

More information

Applications of Suffix Tree

Applications of Suffix Tree Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Efficient Implementation of Suffix Trees

Efficient Implementation of Suffix Trees SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(2), 129 141 (FEBRUARY 1995) Efficient Implementation of Suffix Trees ARNE ANDERSSON AND STEFAN NILSSON Department of Computer Science, Lund University, Box 118,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees

Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees Krisztián Monostori, Arkady Zaslavsky, Heinz Schmidt School of Computer Science and Software Engineering Monash University, Melbourne

More information

Subject Index. Journal of Discrete Algorithms 5 (2007)

Subject Index. Journal of Discrete Algorithms 5 (2007) Journal of Discrete Algorithms 5 (2007) 751 755 www.elsevier.com/locate/jda Subject Index Ad hoc and wireless networks Ad hoc networks Admission control Algorithm ; ; A simple fast hybrid pattern-matching

More information

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of

More information

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Efficient subset and superset queries

Efficient subset and superset queries Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper

More information

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Exact String Matching. The Knuth-Morris-Pratt Algorithm Exact String Matching The Knuth-Morris-Pratt Algorithm Outline for Today The Exact Matching Problem A simple algorithm Motivation for better algorithms The Knuth-Morris-Pratt algorithm The Exact Matching

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms??? SORTING + STRING COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book.

More information

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance Toan Thang Ta, Cheng-Yao Lin and Chin Lung Lu Department of Computer Science National Tsing Hua University, Hsinchu

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017 Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of

More information

Lecture 18 April 12, 2005

Lecture 18 April 12, 2005 6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today

More information

An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem

An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem Filippo Geraci filippo.geraci@iit.cnr.it Marco Pellegrini marco.pellegrini@iit.cnr.it Istituto di Informatica e Telematica

More information

Lecture 12 March 21, 2007

Lecture 12 March 21, 2007 6.85: Advanced Data Structures Spring 27 Oren Weimann Lecture 2 March 2, 27 Scribe: Tural Badirkhanli Overview Last lecture we saw how to perform membership quueries in O() time using hashing. Today we

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator 15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays

More information

Homework 1 Solutions:

Homework 1 Solutions: Homework 1 Solutions: If we expand the square in the statistic, we get three terms that have to be summed for each i: (ExpectedFrequency[i]), (2ObservedFrequency[i]) and (ObservedFrequency[i])2 / Expected

More information

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

Assignment 2 (programming): Problem Description

Assignment 2 (programming): Problem Description CS2210b Data Structures and Algorithms Due: Monday, February 14th Assignment 2 (programming): Problem Description 1 Overview The purpose of this assignment is for students to practice on hashing techniques

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d): Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for

More information

A GENETIC ALGORITHM APPROACH TO OPTIMAL TOPOLOGICAL DESIGN OF ALL TERMINAL NETWORKS

A GENETIC ALGORITHM APPROACH TO OPTIMAL TOPOLOGICAL DESIGN OF ALL TERMINAL NETWORKS A GENETIC ALGORITHM APPROACH TO OPTIMAL TOPOLOGICAL DESIGN OF ALL TERMINAL NETWORKS BERNA DENGIZ AND FULYA ALTIPARMAK Department of Industrial Engineering Gazi University, Ankara, TURKEY 06570 ALICE E.

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

Motif Discovery using optimized Suffix Tries

Motif Discovery using optimized Suffix Tries Motif Discovery using optimized Suffix Tries Sergio Prado Promoter: Prof. dr. ir. Jan Fostier Supervisor: ir. Dieter De Witte Faculty of Engineering and Architecture Department of Information Technology

More information

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu

More information

Theory of Computation Prof. Raghunath Tewari Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Theory of Computation Prof. Raghunath Tewari Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Theory of Computation Prof. Raghunath Tewari Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Lecture 01 Introduction to Finite Automata Welcome everybody. This is

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

LAB # 3 / Project # 1

LAB # 3 / Project # 1 DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises

More information

Lecture 9: Core String Edits and Alignments

Lecture 9: Core String Edits and Alignments Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

Fast Substring Matching

Fast Substring Matching Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Shmuel T. Klein 1, Tamar C. Serebro 1, and Dana Shapira 2 1 Department of Computer Science Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il, t lender@hotmail.com 2 Department of Computer Science

More information

Modeling Delta Encoding of Compressed Files

Modeling Delta Encoding of Compressed Files Modeling Delta Encoding of Compressed Files EXTENDED ABSTRACT S.T. Klein, T.C. Serebro, and D. Shapira 1 Dept of CS Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il 2 Dept of CS Bar Ilan University

More information

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression Yossi Matias Nasir Rajpoot Süleyman Cenk Ṣahinalp Abstract We report on the performance evaluation of greedy parsing with a

More information

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and International Journal of Foundations of Computer Science c World Scientific Publishing Company MODELING DELTA ENCODING OF COMPRESSED FILES SHMUEL T. KLEIN Department of Computer Science, Bar-Ilan University

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc Motivation Motifs Algorithms G-Tries Parallelism Complex Networks Networks are ubiquitous! Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc Images: UK Highways Agency, Uriel

More information

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Search & Optimization Search and Optimization method deals with

More information

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc.

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc. Algorithms Analysis Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc. Algorithms analysis tends to focus on time: Techniques for measuring

More information

15.4 Longest common subsequence

15.4 Longest common subsequence 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK MINING SEQUENTIAL PATTERN WITH DELTA CLOSED PATTERNS AND NONINDUCED PATTERNS FROM

More information

Suffix Trees and its Construction

Suffix Trees and its Construction Chapter 5 Suffix Trees and its Construction 5.1 Introduction to Suffix Trees Sometimes fundamental techniques do not make it into the mainstream of computer science education in spite of its importance,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Parallel Distributed Memory String Indexes

Parallel Distributed Memory String Indexes Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview

More information

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis Computer Science 210 Data Structures Siena College Fall 2017 Topic Notes: Complexity and Asymptotic Analysis Consider the abstract data type, the Vector or ArrayList. This structure affords us the opportunity

More information

In-Place Suffix Sorting

In-Place Suffix Sorting In-Place Suffix Sorting G. Franceschini 1 and S. Muthukrishnan 2 1 Department of Computer Science, University of Pisa francesc@di.unipi.it 2 Google Inc., NY muthu@google.com Abstract. Given string T =

More information

6. Finding Efficient Compressions; Huffman and Hu-Tucker

6. Finding Efficient Compressions; Huffman and Hu-Tucker 6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?

More information

Computing Patterns in Strings I. Specific, Generic, Intrinsic

Computing Patterns in Strings I. Specific, Generic, Intrinsic Outline : Specific, Generic, Intrinsic 1,2,3 1 Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Ontario, Canada email: smyth@mcmaster.ca 2 Digital Ecosystems

More information

Combinatorial Problems on Strings with Applications to Protein Folding

Combinatorial Problems on Strings with Applications to Protein Folding Combinatorial Problems on Strings with Applications to Protein Folding Alantha Newman 1 and Matthias Ruhl 2 1 MIT Laboratory for Computer Science Cambridge, MA 02139 alantha@theory.lcs.mit.edu 2 IBM Almaden

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental

More information

Ukkonen s suffix tree algorithm

Ukkonen s suffix tree algorithm Ukkonen s suffix tree algorithm Recall McCreight s approach: For i = 1.. n+1, build compressed trie of {x[..n]$ i} Ukkonen s approach: For i = 1.. n+1, build compressed trie of {$ i} Compressed trie of

More information

Lowest Common Ancestor (LCA) Queries

Lowest Common Ancestor (LCA) Queries Lowest Common Ancestor (LCA) Queries A technique with application to approximate matching Chris Lewis Approximate Matching Match pattern to text Insertion/Deletion/Substitution Applications Bioinformatics,

More information

Optimization of Boyer-Moore-Horspool-Sunday Algorithm

Optimization of Boyer-Moore-Horspool-Sunday Algorithm Optimization of Boyer-Moore-Horspool-Sunday Algorithm Rionaldi Chandraseta - 13515077 Program Studi Teknik Informatika Sekolah Teknik Elektro dan Informatika, Institut Teknologi Bandung Bandung, Indonesia

More information