A Suffix Tree Construction Algorithm for DNA Sequences

Size: px
Start display at page:

Download "A Suffix Tree Construction Algorithm for DNA Sequences"

Transcription

1 A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an , China Vojislav Stojkovic Computer Science Department Morgan State University Baltimore, MD 21251, USA Abstract The suffix tree is a powerful data structure in string processing and DNA sequence comparisons. However, constructing suffix trees being very greedy in space is a fatal drawback. In addition, the performance of the suffix tree construction using suffix link will rapidly degrade with the increase of the scale ofsequences to be handled because ofthe random access. In order to overcome these disadvantages, a new bit layout is usedfor the nodes of a suffix tree which has less space requirements. Based on this an algorithm to construct suffix tree for DNA sequences is proposed using partitioning strategies. The effectiveness for the proposed algorithm is shown in the testing cases from NCBI web site. Comparisons with Kurtz's algorithm in space requirements and running time have been made in the experiments. The results show that the proposed algorithm is memory-efficient and has a better performance over Kurtz's algorithm on the average running time. 1. Introduction The suffix tree is one of the most fundamental and important data structures for processing DNA sequence in large amounts of genetic and biochemical data. Suffix trees provide efficient access to all substrings of a string and they can be constructed and represented in linear time and space. A suffix tree is a data structure that displays the internal structure of a string in a deeper way. Suffix trees can be used to solve the exact matching problem in linear time and have the same worst-case bound as the KMP[1] and Boyer-Moore[2] do, but they are more practical. As the suffix trees for large texts, e.g. complete genomes with 3109 base pairs, have been proved to be manageable[3]. Also the suffix trees can deal with the substring problems in O(m) preprocessing and O(n) search time for the input sequence of length m and the pattern of length n. The KMP or Boyer-Moore method can not achieve the bound. Suffix trees can be not only used in the substring processing problems but also in complex repeat-finding problems. For example, MUJMmer[4, 5] is a system for the genome alignment, which uses suffix trees as its main structure to align two closely relative genomes. Owing to the advantages of suffix trees, MUMmer provides a faster, simpler, and more systematic way to solve hard problems. Although suffix trees have these superior features, they are not widely used in actual string processing software. This is because the space consumption of a suffix tree is still quite large, despite the asymptotically linear space[9]. As a consequence, several people have developed alternative index structures which store less information than suffix trees and are more space efficient[6]. They are suffix array, the level compressed trie, the suffix binary search tree, and the suffix cacus[8]. These index structures have to be tailed to some string matching problems and cannot be adapted to other kinds of problems without loss of performance. Also the traditional string methods can not be directly used in the DNA sequences for they are too complex to treat with. Thus reducing the space requirement of suffix trees is still an important problem in the genome processing. The suffix tree was proposed by Weiner[7]. Many improvements have been done for some decades. The early construction of suffix trees focused on developing algorithms in linear space. These algorithms are adapted to a small input size and the whole tree can be constructed in the memory. However, these algorithms are less space efficient, because they suffer from a poor locality of memory reference on cached processor architectures and make it difficult to store in secondary memory. Once the data are too large to be loaded into the memory, it will lead to lots of cache miss and more disk swapping. Thus, how to develop a practical algorithm for suffix tree construction is still an important problem. In order to overcome these disadvantages, a new bit layout is used for the nodes of a suffix tree which has less space requirements. Based on this an algorithm to construct suffix tree for DNA sequences is proposed using partitioning strategies according to the common prefixes to build independent subtree. The experiments show that the proposed algorithm is memory-efficient and has a better performance on the average running time. 2. Preliminary A suffix tree T for a string S with m-character is a rooted directed tree with exactly m leaves numbered 1 to /07/$ IEEE 1178

2 m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edgelabels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. Suffix trees can be constructed in linear time and space by some algorithms[6-9]. The algorithms use suffix link to achieve Some of these algorithms have the O(n) construction time with the help of suffix link, which is a link from an internal node to another. Fig. 1 is an example of suffix tree for string ATTAGTACA, where the dashed line is the suffix link. Sft: ATFAGTACA$ there are lots of short or long small-large chains which are a sequence of small nodes followed by one large node. In a small-large, the values of headposition, depth and suffixlink of all small nodes can be derived from the large node in the end of the chain. Therefore, with the bit optimization technique, Kurtz's algorithm uses four integers for one large node, two integers for one small node and one integer for each leaf node. So the longer the small-large chain is, the more the space is saved. After analyzing, we find that a small-large chain is formed only if all of the nodes in this chain are a series of new nodes just to be created consecutively while a series of suffixes are added into the suffix one by one. However, DNA sequence is not only well known for its repetitive structure but also a small-sized alphabet sequence which has high possibility of repetition. Therefore, using Kurtz method on DNA sequence may not take advantage on small nodes but produces more large nodes Algorithm Fig. 1 The suffix tree of the string 'ATTAGTACA$' 3. Approach 3.1. Analysis When the memory accesses have better temporal locality or spatial locality, modern processors usually use one or more caches to speed up the access to the memory. For the suffix links exists through the suffix trees, the linear construction algorithms, such as Ukkonen[8] and McCreight[7], require lots of random access of the memory. In Ukkonen' s algorithm, cache misses happens when the algorithm makes a traversal via suffix links to reach another new subtree to check its children nodes. Such a traversal causes random access at the very distant locations in memory. Also each access would visit memory with a higher probability because the span of address space is too large to fit into memory. Kurtz's algorithm optimizes the space requirements for the McCreight's algorithm. Kurtz's algorithm divides the internal nodes into large nodes and small nodes to store the suffix tree information based on the relation of head position values. During the construction of internal nodes, From section 3.1 we can draw that if the record to keep internal node information can be reduced to just three integers, then we can save some memory space. Furthermore, the suffix-link based algorithms are not suitable when the input data is very large, so discarding the suffix link might be an ideal way. Thus we use a three-integer bit layout for each internal node record[3]. By the properties of a suffix tree, if we put some suffixes of a branching node of the root together in advance, we can merge the common prefixes of the suffixes step by step during top-down construction of the suffix tree, and generate the internal branching nodes with the common prefix as an edge-label and responding leaf nodes so as to finish the construction of the various branching nodes under the branch. With partition techniques, a new algorithm ST- PTD(Suffix Tree Partition and Top-down) for the construction of suffix tree is proposed. Because of the partition, the larger input is allowable to the construction of suffix tree and the construction for each subtree in the memory is independent. Fig. 2 shows the algorithm ST-PTD. It uses four data structures for the construction of the suffix trees: an array String used to store input string, an array Suffixes used to store partitions, a temporary working space Temp for counting-sort and the suffix tree. Algorithm ST-PTD (String, prefixlen) Phase 1: Preprocessing: 1. Scan the String and partition Suffixes based on the first prefixlen symbols of each suffix Phase 2: Do for each partition 2. Construct suffix tree 3. for each partition Pi do 4. R -Pi /07/$ IEEE 1179

3 5. do 6. S -- counting-sort(r, Temp) 7. if ISI = I then 8. create a leaf / 9. Tree -- Treeu {l} 10. else 11. for each R E S do 12. if IR = then 13. create node n and leaf Tree -Treeu {l,n} 15. else 16. Push(R) 17. if Stack is not empty then 18. R -Pop 19. / = finding-lcp(r) 20. for each suffix-index E R do 21. suffix-index -- suffix-index-/ 22. while Stack is not empty 23. Merge ST-PTD algorithm consists of two phases: partition and subtree construction We divide the suffixes of the input string into AI prefixlen parts, where JAI is the alphabet size of the string and prefixlen is the depth of partitioning. The partition procedure is as follows. First we scan the input string from left to right. At each index position i the prefixlen subsequent characters are used to determine one of the AIprefixlen partitions and this index i is then recorded to the calculated partition's buffer. At the end of the scan, each partition will contain the suffix pointers for suffixes that all have the same prefix of size prefixlen. For DNA sequences, assumed that the internal nodes close to the root is dense for they are highly repetitive and have the small alphabet, we can take value of prefixlen to be the log4seq_length - 1. However, when the value of prefixlen is large than 7, the running time for partition phase for large dataset, such as genome, is costly and can not bring the obvious advantages to the algorithm, thus we take the value of prefixlen to be the (log4seq_length-1)/ Time and space complexity The execution time for ST-PTD algorithm is O(n2) in the worst case. The suffix tree can be represented with the number of n+3a integers for a sequence with length n, where a is the number of internal nodes. Thus each character requires (4n+12a)/n bytes on the average for a 32-bit computer. The ratio for a/n is about 0.66 for the DNA sequences we use in the experiment. Therefore each character in the sequence requires bytes on the average. 4. Experimental results and analysis 4.1 Space requirements We use the DNA sequences from NCBI web site to compare the space requirement of ST-PTD with the space requirement of Kurtz[5]. Also the numbers given in the table just refer to the space required for construction, not including the n bytes used to store the input string. Table 1 The space requirement of Kurtz's algorithm and ST-PTD Length Kurtz'algo ST-PTD AC AC BC J M M M V X ecoli [Average] Table 1 shows the space requirements for each sequence. The space requirement is defined as how many bytes one character uses on average. The first column is the names of DNA sequences and the second is the lengths. The third and fourth ate the space requirement of Kurtz and ST-PTD, respectively. Compared with Kurtz's method, ST-PTD saves about in space. Also there is no relationship between space needs and the length of sequence. However, the sequence structure, such as J03071, has a great effect on the space demand Running time Two algorithms Kurtz's method and ST-PTD have been implemented in the experiments. The programs were written in C and compiled with GCC. To demonstrate the impact of the memory on the algorithms, programs were run on two different platforms. One platform we call configl and another confug2. The Specific configuration for configl and confug2 are Intel Pentium 4.3GHZ, 512M RAM, Red Hat Linux 9 and Intel Pentium III 1.3 GHZ, 128M RAM, Fedora 4, respectively. The experimental results are shown in Table 2. The running time is in seconds and throughout is the ratio of time multiplied by 106 to sequence length. The dark shaded areas show the better throughout. '-' shows the running time more than 1 hour /07/$ IEEE 1180

4 Table 2 The running time and throughout of Kurtz's algorithm and ST-PTD Config 1 Config 2 Kurtz's algo ST-PTD Kurtz's algo ST-PTD Sequence Length time tput time tput time tput time tput J V AC M M AC X B_anthracis_Mslice H.sapiens chr.1oslicel H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.1oslice ecoli H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.10 slice influenza slice H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.10 slicelo H.sapiens chr.1oslicell H.sapiens chr.10 slicel Arabidopsis thaliana chr. 4 H. sapiens chr slicel 3 [Average] The main data structures we use in the two algorithms are arrays, because it has a higher efficiency in time. However, it also limits the size of the data they can deal with. However, we still use array to achieve, because Kurtz's algorithm in which linked lists were used to implement takes the time of seconds (about 20 m) for the sequence B anthracis_mslice of length 317k and over four hours for the sequence ecoil of length 4.6M, respectively. From the table we can get the facts. Although ST-PTD algorithm has a running time of O(n2) and Kurtz's algorithm has a running time of O(n) in the worst case, ST-PTD is a little faster than Kurtz's algorithm on the average running time. This also shows that locality of memory reference has a great influence on the running time of the algorithms. The partition strategies and the sequence structure also have the impact on the performance of algorithms. For example, the difference induced by the unbalanced partitions on the sequence influenza slice is obvious. ST-PTD algorithm has greater advantages on Kurtz's algorithm for the lower configuration because of its partition phase. The partition phase decreases the size of the set of problems we are processing so that we can deal with the larger size of data. Comparing the running time of the two algorithms in the different configurations, we can see that memory is still one of the bottlenecks affecting the performances of the algorithms for the suffix tree is indeed very greedy for space. In addition, compared with Kurtz's algorithm ST-PTD algorithm is easier to understand and implement. Also ST-PTD algorithm is easier to parallel because the construction for each sub-tree is independent /07/$ IEEE 1181

5 memory," Technical Report , 2003, 5. Conclusions Univeristy of Bielefeld, Germany A new suffix tree construction Algorithm is presented in the paper. Without using the small-large chain and suffix-link in the construction of suffix tree, we use a new bit layout instead. Based on this we explore an algorithm to construct suffix tree on DNA sequences using a partitioning strategies according to the common prefix which allows one to build independent subtree in memory. ST-PTD is cacheefficient though with O(n2) worst-case complexity and the experiments show that our proposed method has a better performance in running time. Our algorithm is cache-efficient though with O(n2) worst-case complexity and the experiments show that our proposed method has a better performance in average running time. References [1] D. E. Knuth, J. H. Morris, and V. B. Pratt, "Fast pattern matching in strings," SIAM Journal on Computing, 1977, Vol. 6, pp [2] R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Communications of the ACM, 1977,Vol. 20, pp [3] Yun-Ching Chen & Suh-Yin Lee, "Parsimonyspaced suffix trees for DNA sequences," ISMSE'03, Nov, [4] Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White and Steven L. Salzberg, "Alignment of whole genomes," Nucleic Acids Research, 1999, Vol. 27, pp [5] Aurthur L. Delcher, Adam Phillippy, Jane Carlton and Steven L. Salzberg, "Fast algorithms for largescale genome alignment and comparison," Nucleic Acids Research, 2002, Vol. 30, pp [6] Kurtz, S, "Reducing the space requirement of suffix trees," Software Pract. Experience, 1999, Vol. 29, pp [7] P. Weiner, "Linear pattern matching algorithms," Proceeding of the 14th IEEE Symposium on Switching and Automata Theory, 1973, pp [8] E. M. McCreight, "A space-economical suffix tree construction algorithm," Journal ofacm, 1976, Vol 23, pp [9] E. Ukkonen, "On-line construction of suffix-trees," Algorithmica, 1995, Vol. 14, pp [10]Giegerich, R., Kurtz, S., Stoye, J.,"Efficient implementation of lazy suffix trees," Soft. Pract. Exp. 2003, [1 ] Schurmann, K.-B., Stoye, J.,"Suffix-tree construction and storage with limited main /07/$ IEEE 1182

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

Given a text file, or several text files, how do we search for a query string?

Given a text file, or several text files, how do we search for a query string? CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

Practical methods for constructing suffix trees

Practical methods for constructing suffix trees The VLDB Journal (25) 14(3): 281 299 DOI 1.17/s778-5-154-8 REGULAR PAPER Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel Practical methods for constructing suffix trees Received: 14 October

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

Application of the BWT Method to Solve the Exact String Matching Problem

Application of the BWT Method to Solve the Exact String Matching Problem Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Inexact Pattern Matching Algorithms via Automata 1

Inexact Pattern Matching Algorithms via Automata 1 Inexact Pattern Matching Algorithms via Automata 1 1. Introduction Chung W. Ng BioChem 218 March 19, 2007 Pattern matching occurs in various applications, ranging from simple text searching in word processors

More information

Suffix trees and applications. String Algorithms

Suffix trees and applications. String Algorithms Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES

TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES BENJARATH PHOOPHAKDEE AND MOHAMMED J. ZAKI Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY,

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Cache-Oblivious String Dictionaries

Cache-Oblivious String Dictionaries Cache-Oblivious String Dictionaries Gerth Stølting Brodal University of Aarhus Joint work with Rolf Fagerberg #"! Outline of Talk Cache-oblivious model Basic cache-oblivious techniques Cache-oblivious

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

Analysis of parallel suffix tree construction

Analysis of parallel suffix tree construction 168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)

More information

UNIT III BALANCED SEARCH TREES AND INDEXING

UNIT III BALANCED SEARCH TREES AND INDEXING UNIT III BALANCED SEARCH TREES AND INDEXING OBJECTIVE The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions and finds in constant

More information

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d): Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for

More information

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011 Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA December 16, 2011 Abstract KMP is a string searching algorithm. The problem is to find the occurrence of P in S, where S is the given

More information

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays 4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the

More information

Advanced Algorithms: Project

Advanced Algorithms: Project Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and

More information

Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms

Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms Journal of Advances in Information Technology Vol. 7, No. 4, November 016 Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring

More information

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input

More information

A Practical Distributed String Matching Algorithm Architecture and Implementation

A Practical Distributed String Matching Algorithm Architecture and Implementation A Practical Distributed String Matching Algorithm Architecture and Implementation Bi Kun, Gu Nai-jie, Tu Kun, Liu Xiao-hu, and Liu Gang International Science Index, Computer and Information Engineering

More information

Applications of Suffix Tree

Applications of Suffix Tree Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences

More information

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade

More information

Efficient Implementation of Suffix Trees

Efficient Implementation of Suffix Trees SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(2), 129 141 (FEBRUARY 1995) Efficient Implementation of Suffix Trees ARNE ANDERSSON AND STEFAN NILSSON Department of Computer Science, Lund University, Box 118,

More information

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017 Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

Applications of Succinct Dynamic Compact Tries to Some String Problems

Applications of Succinct Dynamic Compact Tries to Some String Problems Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Lecture 18 April 12, 2005

Lecture 18 April 12, 2005 6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5 Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

More information

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Final Examination CSE 100 UCSD (Practice)

Final Examination CSE 100 UCSD (Practice) Final Examination UCSD (Practice) RULES: 1. Don t start the exam until the instructor says to. 2. This is a closed-book, closed-notes, no-calculator exam. Don t refer to any materials other than the exam

More information

Suffix-based text indices, construction algorithms, and applications.

Suffix-based text indices, construction algorithms, and applications. Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in

More information

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms??? SORTING + STRING COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book.

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms 1. Naïve String Matching The naïve approach simply test all the possible placement of Pattern P[1.. m] relative to text T[1.. n]. Specifically, we try shift s = 0, 1,..., n -

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Report Seminar Algorithm Engineering

Report Seminar Algorithm Engineering Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science

More information

A Survey on Disk-based Genome. Sequence Indexing

A Survey on Disk-based Genome. Sequence Indexing Contemporary Engineering Sciences, Vol. 7, 2014, no. 15, 743-748 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.4684 A Survey on Disk-based Genome Sequence Indexing Woong-Kee Loh Department

More information

Fast Hybrid String Matching Algorithms

Fast Hybrid String Matching Algorithms Fast Hybrid String Matching Algorithms Jamuna Bhandari 1 and Anil Kumar 2 1 Dept. of CSE, Manipal University Jaipur, INDIA 2 Dept of CSE, Manipal University Jaipur, INDIA ABSTRACT Various Hybrid algorithms

More information

An Optimal Algorithm for the Euclidean Bottleneck Full Steiner Tree Problem

An Optimal Algorithm for the Euclidean Bottleneck Full Steiner Tree Problem An Optimal Algorithm for the Euclidean Bottleneck Full Steiner Tree Problem Ahmad Biniaz Anil Maheshwari Michiel Smid September 30, 2013 Abstract Let P and S be two disjoint sets of n and m points in the

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

COMP4128 Programming Challenges

COMP4128 Programming Challenges Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first

More information

Full-Text Search on Data with Access Control

Full-Text Search on Data with Access Control Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.

More information

Fast Exact String Matching Algorithms

Fast Exact String Matching Algorithms Fast Exact String Matching Algorithms Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, Traitement de l Information, Systèmes. Part of this work has been done with Maxime Crochemore

More information

Enhanced Suffix Trees. for Very Large DNA Sequences

Enhanced Suffix Trees. for Very Large DNA Sequences Enhanced Suffix Trees for Very Large DNA Sequences Si Ai Fan A Thesis In the Department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements for the Degree

More information

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي للعام الدراسي: 2017/2016 The Introduction The introduction to information theory is quite simple. The invention of writing occurred

More information

On the Suitability of Suffix Arrays for Lempel-Ziv Data Compression

On the Suitability of Suffix Arrays for Lempel-Ziv Data Compression On the Suitability of Suffix Arrays for Lempel-Ziv Data Compression Artur J. Ferreira 1,3 Arlindo L. Oliveira 2,4 Mário A. T. Figueiredo 3,4 1 Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

A New String Matching Algorithm Based on Logical Indexing

A New String Matching Algorithm Based on Logical Indexing The 5th International Conference on Electrical Engineering and Informatics 2015 August 10-11, 2015, Bali, Indonesia A New String Matching Algorithm Based on Logical Indexing Daniar Heri Kurniawan Department

More information

DATA STRUCTURES/UNIT 3

DATA STRUCTURES/UNIT 3 UNIT III SORTING AND SEARCHING 9 General Background Exchange sorts Selection and Tree Sorting Insertion Sorts Merge and Radix Sorts Basic Search Techniques Tree Searching General Search Trees- Hashing.

More information

Table of Contents. Chapter 1: Introduction to Data Structures... 1

Table of Contents. Chapter 1: Introduction to Data Structures... 1 Table of Contents Chapter 1: Introduction to Data Structures... 1 1.1 Data Types in C++... 2 Integer Types... 2 Character Types... 3 Floating-point Types... 3 Variables Names... 4 1.2 Arrays... 4 Extraction

More information

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

A Prototype for Multiple Whole Genome Alignment

A Prototype for Multiple Whole Genome Alignment A Prototype for Multiple Whole Genome Alignment Jitender S. Deogun, Fangrui Ma, Jingyi Yang Department of Computer Science and Engineering University of Nebraska Lincoln Lincoln, NE 6888-0, USA Andrew

More information

Department of Computer Science and Technology

Department of Computer Science and Technology UNIT : Stack & Queue Short Questions 1 1 1 1 1 1 1 1 20) 2 What is the difference between Data and Information? Define Data, Information, and Data Structure. List the primitive data structure. List the

More information

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Document Compression and Ciphering Using Pattern Matching Technique

Document Compression and Ciphering Using Pattern Matching Technique Document Compression and Ciphering Using Pattern Matching Technique Sawood Alam Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India, ibnesayeed@gmail.com Abstract This paper describes

More information

Lossless Compression Algorithms

Lossless Compression Algorithms Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms

More information

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several

More information

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department

More information

Indexing Methods. Lecture 9. Storage Requirements of Databases

Indexing Methods. Lecture 9. Storage Requirements of Databases Indexing Methods Lecture 9 Storage Requirements of Databases Need data to be stored permanently or persistently for long periods of time Usually too big to fit in main memory Low cost of storage per unit

More information

( ). Which of ( ) ( ) " #& ( ) " # g( n) ( ) " # f ( n) Test 1

( ). Which of ( ) ( )  #& ( )  # g( n) ( )  # f ( n) Test 1 CSE 0 Name Test Summer 006 Last Digits of Student ID # Multiple Choice. Write your answer to the LEFT of each problem. points each. The time to multiply two n x n matrices is: A. "( n) B. "( nlogn) # C.

More information

Growth of the Internet Network capacity: A scarce resource Good Service

Growth of the Internet Network capacity: A scarce resource Good Service IP Route Lookups 1 Introduction Growth of the Internet Network capacity: A scarce resource Good Service Large-bandwidth links -> Readily handled (Fiber optic links) High router data throughput -> Readily

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Indexing Genomic Sequences on the IBM Blue Gene

Indexing Genomic Sequences on the IBM Blue Gene Indexing Genomic Sequences on the IBM Blue Gene Amol Ghoting IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA aghoting@us.ibm.com Konstantin Makarychev IBM T. J. Watson Research Center

More information

11/5/13 Comp 555 Fall

11/5/13 Comp 555 Fall 11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace

More information

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count 2011 International Conference on Life Science and Technology IPCBEE vol.3 (2011) (2011) IACSIT Press, Singapore An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count Raju Bhukya

More information

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016 CMPUT 403: Strings Zachary Friggstad March 11, 2016 Outline Tries Suffix Arrays Knuth-Morris-Pratt Pattern Matching Tries Given a dictionary D of strings and a query string s, determine if s is in D. Using

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Benchmarking a B-tree compression method

Benchmarking a B-tree compression method Benchmarking a B-tree compression method Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, Technical University of Ostrava, Czech Republic {filip.krizka,michal.kratky,radim.baca}@vsb.cz

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

Text Compression through Huffman Coding. Terminology

Text Compression through Huffman Coding. Terminology Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character

More information

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree. The Lecture Contains: Index structure Binary search tree (BST) B-tree B+-tree Order file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture13/13_1.htm[6/14/2012

More information

Algorithms and Data Structures: Efficient and Cache-Oblivious

Algorithms and Data Structures: Efficient and Cache-Oblivious 7 Ritika Angrish and Dr. Deepak Garg Algorithms and Data Structures: Efficient and Cache-Oblivious Ritika Angrish* and Dr. Deepak Garg Department of Computer Science and Engineering, Thapar University,

More information

Efficient Stream Reduction on the GPU

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University

More information

Fast Substring Matching

Fast Substring Matching Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which

More information

Suffix Trees on Words

Suffix Trees on Words Suffix Trees on Words Arne Andersson N. Jesper Larsson Kurt Swanson Dept. of Computer Science, Lund University, Box 118, S-221 00 LUND, Sweden {arne,jesper,kurt}@dna.lth.se Abstract We discuss an intrinsic

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July-201 971 Comparative Performance Analysis Of Sorting Algorithms Abhinav Yadav, Dr. Sanjeev Bansal Abstract Sorting Algorithms

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

An Introduction to Trees

An Introduction to Trees An Introduction to Trees Alice E. Fischer Spring 2017 Alice E. Fischer An Introduction to Trees... 1/34 Spring 2017 1 / 34 Outline 1 Trees the Abstraction Definitions 2 Expression Trees 3 Binary Search

More information

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator 15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays

More information

Algorithms and Data Structures Lesson 3

Algorithms and Data Structures Lesson 3 Algorithms and Data Structures Lesson 3 Michael Schwarzkopf https://www.uni weimar.de/de/medien/professuren/medieninformatik/grafische datenverarbeitung Bauhaus University Weimar May 30, 2018 Overview...of

More information