A Suffix Tree Construction Algorithm for DNA Sequences

Size: px

Start display at page:

Download "A Suffix Tree Construction Algorithm for DNA Sequences"

Lilian Bates
6 years ago
Views:

A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan

1 A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an , China Vojislav Stojkovic Computer Science Department Morgan State University Baltimore, MD 21251, USA Abstract The suffix tree is a powerful data structure in string processing and DNA sequence comparisons. However, constructing suffix trees being very greedy in space is a fatal drawback. In addition, the performance of the suffix tree construction using suffix link will rapidly degrade with the increase of the scale ofsequences to be handled because ofthe random access. In order to overcome these disadvantages, a new bit layout is usedfor the nodes of a suffix tree which has less space requirements. Based on this an algorithm to construct suffix tree for DNA sequences is proposed using partitioning strategies. The effectiveness for the proposed algorithm is shown in the testing cases from NCBI web site. Comparisons with Kurtz's algorithm in space requirements and running time have been made in the experiments. The results show that the proposed algorithm is memory-efficient and has a better performance over Kurtz's algorithm on the average running time. 1. Introduction The suffix tree is one of the most fundamental and important data structures for processing DNA sequence in large amounts of genetic and biochemical data. Suffix trees provide efficient access to all substrings of a string and they can be constructed and represented in linear time and space. A suffix tree is a data structure that displays the internal structure of a string in a deeper way. Suffix trees can be used to solve the exact matching problem in linear time and have the same worst-case bound as the KMP[1] and Boyer-Moore[2] do, but they are more practical. As the suffix trees for large texts, e.g. complete genomes with 3109 base pairs, have been proved to be manageable[3]. Also the suffix trees can deal with the substring problems in O(m) preprocessing and O(n) search time for the input sequence of length m and the pattern of length n. The KMP or Boyer-Moore method can not achieve the bound. Suffix trees can be not only used in the substring processing problems but also in complex repeat-finding problems. For example, MUJMmer[4, 5] is a system for the genome alignment, which uses suffix trees as its main structure to align two closely relative genomes. Owing to the advantages of suffix trees, MUMmer provides a faster, simpler, and more systematic way to solve hard problems. Although suffix trees have these superior features, they are not widely used in actual string processing software. This is because the space consumption of a suffix tree is still quite large, despite the asymptotically linear space[9]. As a consequence, several people have developed alternative index structures which store less information than suffix trees and are more space efficient[6]. They are suffix array, the level compressed trie, the suffix binary search tree, and the suffix cacus[8]. These index structures have to be tailed to some string matching problems and cannot be adapted to other kinds of problems without loss of performance. Also the traditional string methods can not be directly used in the DNA sequences for they are too complex to treat with. Thus reducing the space requirement of suffix trees is still an important problem in the genome processing. The suffix tree was proposed by Weiner[7]. Many improvements have been done for some decades. The early construction of suffix trees focused on developing algorithms in linear space. These algorithms are adapted to a small input size and the whole tree can be constructed in the memory. However, these algorithms are less space efficient, because they suffer from a poor locality of memory reference on cached processor architectures and make it difficult to store in secondary memory. Once the data are too large to be loaded into the memory, it will lead to lots of cache miss and more disk swapping. Thus, how to develop a practical algorithm for suffix tree construction is still an important problem. In order to overcome these disadvantages, a new bit layout is used for the nodes of a suffix tree which has less space requirements. Based on this an algorithm to construct suffix tree for DNA sequences is proposed using partitioning strategies according to the common prefixes to build independent subtree. The experiments show that the proposed algorithm is memory-efficient and has a better performance on the average running time. 2. Preliminary A suffix tree T for a string S with m-character is a rooted directed tree with exactly m leaves numbered 1 to /07/$ IEEE 1178

2 m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edgelabels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. Suffix trees can be constructed in linear time and space by some algorithms[6-9]. The algorithms use suffix link to achieve Some of these algorithms have the O(n) construction time with the help of suffix link, which is a link from an internal node to another. Fig. 1 is an example of suffix tree for string ATTAGTACA, where the dashed line is the suffix link. Sft: ATFAGTACA$ there are lots of short or long small-large chains which are a sequence of small nodes followed by one large node. In a small-large, the values of headposition, depth and suffixlink of all small nodes can be derived from the large node in the end of the chain. Therefore, with the bit optimization technique, Kurtz's algorithm uses four integers for one large node, two integers for one small node and one integer for each leaf node. So the longer the small-large chain is, the more the space is saved. After analyzing, we find that a small-large chain is formed only if all of the nodes in this chain are a series of new nodes just to be created consecutively while a series of suffixes are added into the suffix one by one. However, DNA sequence is not only well known for its repetitive structure but also a small-sized alphabet sequence which has high possibility of repetition. Therefore, using Kurtz method on DNA sequence may not take advantage on small nodes but produces more large nodes Algorithm Fig. 1 The suffix tree of the string 'ATTAGTACA$' 3. Approach 3.1. Analysis When the memory accesses have better temporal locality or spatial locality, modern processors usually use one or more caches to speed up the access to the memory. For the suffix links exists through the suffix trees, the linear construction algorithms, such as Ukkonen[8] and McCreight[7], require lots of random access of the memory. In Ukkonen' s algorithm, cache misses happens when the algorithm makes a traversal via suffix links to reach another new subtree to check its children nodes. Such a traversal causes random access at the very distant locations in memory. Also each access would visit memory with a higher probability because the span of address space is too large to fit into memory. Kurtz's algorithm optimizes the space requirements for the McCreight's algorithm. Kurtz's algorithm divides the internal nodes into large nodes and small nodes to store the suffix tree information based on the relation of head position values. During the construction of internal nodes, From section 3.1 we can draw that if the record to keep internal node information can be reduced to just three integers, then we can save some memory space. Furthermore, the suffix-link based algorithms are not suitable when the input data is very large, so discarding the suffix link might be an ideal way. Thus we use a three-integer bit layout for each internal node record[3]. By the properties of a suffix tree, if we put some suffixes of a branching node of the root together in advance, we can merge the common prefixes of the suffixes step by step during top-down construction of the suffix tree, and generate the internal branching nodes with the common prefix as an edge-label and responding leaf nodes so as to finish the construction of the various branching nodes under the branch. With partition techniques, a new algorithm ST- PTD(Suffix Tree Partition and Top-down) for the construction of suffix tree is proposed. Because of the partition, the larger input is allowable to the construction of suffix tree and the construction for each subtree in the memory is independent. Fig. 2 shows the algorithm ST-PTD. It uses four data structures for the construction of the suffix trees: an array String used to store input string, an array Suffixes used to store partitions, a temporary working space Temp for counting-sort and the suffix tree. Algorithm ST-PTD (String, prefixlen) Phase 1: Preprocessing: 1. Scan the String and partition Suffixes based on the first prefixlen symbols of each suffix Phase 2: Do for each partition 2. Construct suffix tree 3. for each partition Pi do 4. R -Pi /07/$ IEEE 1179

3 5. do 6. S -- counting-sort(r, Temp) 7. if ISI = I then 8. create a leaf / 9. Tree -- Treeu {l} 10. else 11. for each R E S do 12. if IR = then 13. create node n and leaf Tree -Treeu {l,n} 15. else 16. Push(R) 17. if Stack is not empty then 18. R -Pop 19. / = finding-lcp(r) 20. for each suffix-index E R do 21. suffix-index -- suffix-index-/ 22. while Stack is not empty 23. Merge ST-PTD algorithm consists of two phases: partition and subtree construction We divide the suffixes of the input string into AI prefixlen parts, where JAI is the alphabet size of the string and prefixlen is the depth of partitioning. The partition procedure is as follows. First we scan the input string from left to right. At each index position i the prefixlen subsequent characters are used to determine one of the AIprefixlen partitions and this index i is then recorded to the calculated partition's buffer. At the end of the scan, each partition will contain the suffix pointers for suffixes that all have the same prefix of size prefixlen. For DNA sequences, assumed that the internal nodes close to the root is dense for they are highly repetitive and have the small alphabet, we can take value of prefixlen to be the log4seq_length - 1. However, when the value of prefixlen is large than 7, the running time for partition phase for large dataset, such as genome, is costly and can not bring the obvious advantages to the algorithm, thus we take the value of prefixlen to be the (log4seq_length-1)/ Time and space complexity The execution time for ST-PTD algorithm is O(n2) in the worst case. The suffix tree can be represented with the number of n+3a integers for a sequence with length n, where a is the number of internal nodes. Thus each character requires (4n+12a)/n bytes on the average for a 32-bit computer. The ratio for a/n is about 0.66 for the DNA sequences we use in the experiment. Therefore each character in the sequence requires bytes on the average. 4. Experimental results and analysis 4.1 Space requirements We use the DNA sequences from NCBI web site to compare the space requirement of ST-PTD with the space requirement of Kurtz[5]. Also the numbers given in the table just refer to the space required for construction, not including the n bytes used to store the input string. Table 1 The space requirement of Kurtz's algorithm and ST-PTD Length Kurtz'algo ST-PTD AC AC BC J M M M V X ecoli [Average] Table 1 shows the space requirements for each sequence. The space requirement is defined as how many bytes one character uses on average. The first column is the names of DNA sequences and the second is the lengths. The third and fourth ate the space requirement of Kurtz and ST-PTD, respectively. Compared with Kurtz's method, ST-PTD saves about in space. Also there is no relationship between space needs and the length of sequence. However, the sequence structure, such as J03071, has a great effect on the space demand Running time Two algorithms Kurtz's method and ST-PTD have been implemented in the experiments. The programs were written in C and compiled with GCC. To demonstrate the impact of the memory on the algorithms, programs were run on two different platforms. One platform we call configl and another confug2. The Specific configuration for configl and confug2 are Intel Pentium 4.3GHZ, 512M RAM, Red Hat Linux 9 and Intel Pentium III 1.3 GHZ, 128M RAM, Fedora 4, respectively. The experimental results are shown in Table 2. The running time is in seconds and throughout is the ratio of time multiplied by 106 to sequence length. The dark shaded areas show the better throughout. '-' shows the running time more than 1 hour /07/$ IEEE 1180

4 Table 2 The running time and throughout of Kurtz's algorithm and ST-PTD Config 1 Config 2 Kurtz's algo ST-PTD Kurtz's algo ST-PTD Sequence Length time tput time tput time tput time tput J V AC M M AC X B_anthracis_Mslice H.sapiens chr.1oslicel H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.1oslice ecoli H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.10 slice influenza slice H.sapiens chr.10 slice H.sapiens chr.10 slice H.sapiens chr.10 slicelo H.sapiens chr.1oslicell H.sapiens chr.10 slicel Arabidopsis thaliana chr. 4 H. sapiens chr slicel 3 [Average] The main data structures we use in the two algorithms are arrays, because it has a higher efficiency in time. However, it also limits the size of the data they can deal with. However, we still use array to achieve, because Kurtz's algorithm in which linked lists were used to implement takes the time of seconds (about 20 m) for the sequence B anthracis_mslice of length 317k and over four hours for the sequence ecoil of length 4.6M, respectively. From the table we can get the facts. Although ST-PTD algorithm has a running time of O(n2) and Kurtz's algorithm has a running time of O(n) in the worst case, ST-PTD is a little faster than Kurtz's algorithm on the average running time. This also shows that locality of memory reference has a great influence on the running time of the algorithms. The partition strategies and the sequence structure also have the impact on the performance of algorithms. For example, the difference induced by the unbalanced partitions on the sequence influenza slice is obvious. ST-PTD algorithm has greater advantages on Kurtz's algorithm for the lower configuration because of its partition phase. The partition phase decreases the size of the set of problems we are processing so that we can deal with the larger size of data. Comparing the running time of the two algorithms in the different configurations, we can see that memory is still one of the bottlenecks affecting the performances of the algorithms for the suffix tree is indeed very greedy for space. In addition, compared with Kurtz's algorithm ST-PTD algorithm is easier to understand and implement. Also ST-PTD algorithm is easier to parallel because the construction for each sub-tree is independent /07/$ IEEE 1181

5 memory," Technical Report , 2003, 5. Conclusions Univeristy of Bielefeld, Germany A new suffix tree construction Algorithm is presented in the paper. Without using the small-large chain and suffix-link in the construction of suffix tree, we use a new bit layout instead. Based on this we explore an algorithm to construct suffix tree on DNA sequences using a partitioning strategies according to the common prefix which allows one to build independent subtree in memory. ST-PTD is cacheefficient though with O(n2) worst-case complexity and the experiments show that our proposed method has a better performance in running time. Our algorithm is cache-efficient though with O(n2) worst-case complexity and the experiments show that our proposed method has a better performance in average running time. References [1] D. E. Knuth, J. H. Morris, and V. B. Pratt, "Fast pattern matching in strings," SIAM Journal on Computing, 1977, Vol. 6, pp [2] R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Communications of the ACM, 1977,Vol. 20, pp [3] Yun-Ching Chen & Suh-Yin Lee, "Parsimonyspaced suffix trees for DNA sequences," ISMSE'03, Nov, [4] Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White and Steven L. Salzberg, "Alignment of whole genomes," Nucleic Acids Research, 1999, Vol. 27, pp [5] Aurthur L. Delcher, Adam Phillippy, Jane Carlton and Steven L. Salzberg, "Fast algorithms for largescale genome alignment and comparison," Nucleic Acids Research, 2002, Vol. 30, pp [6] Kurtz, S, "Reducing the space requirement of suffix trees," Software Pract. Experience, 1999, Vol. 29, pp [7] P. Weiner, "Linear pattern matching algorithms," Proceeding of the 14th IEEE Symposium on Switching and Automata Theory, 1973, pp [8] E. M. McCreight, "A space-economical suffix tree construction algorithm," Journal ofacm, 1976, Vol 23, pp [9] E. Ukkonen, "On-line construction of suffix-trees," Algorithmica, 1995, Vol. 14, pp [10]Giegerich, R., Kurtz, S., Stoye, J.,"Efficient implementation of lazy suffix trees," Soft. Pract. Exp. 2003, [1 ] Schurmann, K.-B., Stoye, J.,"Suffix-tree construction and storage with limited main /07/$ IEEE 1182

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics