RNA Secondary Structure Prediction by Stochastic Context-Free Grammars

Size: px

Start display at page:

Download "RNA Secondary Structure Prediction by Stochastic Context-Free Grammars"

Willa Rosamond Carter
5 years ago
Views:

1 Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Coordinator: Prof. Dr. Eng. J.-P. Martens, Ghent University Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain DISSERTATION SUBMITTED IN ORDER TO OBTAIN THE ACADEMIC DEGREE OF ELECTRICAL ENGINEER Academic year

3 Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Coordinator: Prof. Dr. Eng. J.-P. Martens, Ghent University Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain DISSERTATION SUBMITTED IN ORDER TO OBTAIN THE ACADEMIC DEGREE OF ELECTRICAL ENGINEER Academic year

4 RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Dissertation submitted in order to obtain the academic degree of Electrical Engineer Academic year University of Ghent Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout Coordinator: Prof. Dr. Eng. J.P. Martens, University of Ghent Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain Summary The function most types of RNA molecules perform is determined by their structure, which on its turn is determined by the linear RNA sequence. Predicting the secondary structure of an RNA molecule out of the linear base sequence is a challenge in bioinformatics, with applications in medical sciences, biology and phylogenetic history. In this study, the different known methods of RNA secondary structure prediction are studied first. Then, a number of algorithms and programs are developed, as tools to apply to the problem. An existing algorithm for finding substrings in large strings using suffix trees is extended to an algorithm code that lists repetitions and biological palindromes in DNA or RNA sequences, and this is programmed in ANSI C. A basic hidden Markov model is programmed in Matlab, and then extended to the more general model of stochastic contextfree grammars. Algorithms for this model are implemented in Chomsky normal form. Next, the stochastic context-free grammars are described specifically for RNA modelling. At the end of the project, an attempt is made to develop a new prediction approach. Two probabilistic models are constructed, considering RNA molecule features as regarded in the existing thermodynamic approach from the Zuker algorithm. Extensions of the previous probabilistic algorithms are programmed for these two specific cases, the complete models are trained with sequences from RNA databases, and their prediction accuracy is tested on unknown sequences. Results suggest model improvements, and a list of refinements is suggested at the end of this report. Keywords: RNA, secondary structure, hidden Markov model, stochastic context-free grammar, suffix tree, bioinformatics. i

5 The author gives his permission to make this dissertation available for consultation and to copy parts of it for personal use. Any other use is subject to the restrictions of the copyright, in specific the obligation of explicit source reference when stating the results from this dissertation. Steven Van Vaerenbergh June ii

6 Acknowledgements First of all, I wish to thank my project instructor Luis Vielva, for his enthusiastic support and suggestions for different approaches to the encountered problems, and my coordinator Jean-Pierre Martens, for supervising this final year project. Helpful advice on some biological issues in this study was provided by Fernando de la Cruz, professor of Molecular Biology at the University of Cantabria. I also wish to thank my parents, for supporting me and giving me the opportunity to finish my studies, and my girlfriend Angela, for cheering me up when experimental results in this investigation tried the opposite. iii

7 It must be recognized that the notion probability of a sentence is an entirely useless one, under any interpretation of this term. Noam Chomsky, famous linguist, on the probabilistic approach to handle grammars Every time I fire a linguist, my system s performance improves. Fred Jelenik, former head of the IBM speech recognition group, on the statistical language recognizer 1 1 Credit for pairing these quotes goes to D. Jurafsky and J. Martin [8] iv

8 Contents 1 Introduction 1 2 Molecular Genetics and the Sequencing Evolution Unravelling the Genome DNA RNA Proteins Sequence Databases Mapping and Sequencing the Human Genome Database Search Sequence Similarity Multiple Alignments Biological View Sequence Applications of Suffix Trees The Suffix Tree Suffix Trie Suffix Tree Definition Mechanics Basics Of The Algorithm Example: Active and End Point Suffix Pointer Source code Algorithm for Finding Repetitions The Repetition List Restrictions on repetitions Algorithm for Finding Complementary Reversals Complementary Reversals (CRs) Algorithm: the Concatenation Approach Program Results Application to RNA Suffix Trees for Very Large Sequences Conclusions v

9 4 Stochastic Context-Free Grammars Hidden Markov Models Example Elements of an HMM Definition of an HMM Three Basic Problems and Solutions Transformational Grammars Linguistics Definition Example Parsing Chomksy Hierarchy of Transformational Grammars Regular Grammars Context-Free Grammars Stochastic Grammars Sequence Modelling with SCFGs The inside algorithm The outside algorithm Parameter re-estimation The CYK algorithm Implementation of the Chomsky normal form SCFG Sequence Generation by the Model Itself Implementation Application of the model in Chomsky Normal Form RNA Secondary Structure Prediction using SCFGs Terminology Simple RNA Secondary Structure Prediction The Nussinov Algorithm for Base Pair Maximization A First Nussinov-based SCFG algorithm Use of the Nussinov algorithm A General RNA SCFG Model Obtaining the SCFG Probabilities Model Choice Nonterminals Choice Algorithms choice Training Sequences: trna The First Model Model suppositions Parse tree suppositions Implementation Results vi

10 6.3.5 Interpretation Model Estimation by Inside-Outside Training The Second Model Model Suppositions Parse Tree Suppositions Implementation Results and Interpretation Further improvements Conclusions Conclusions Project overview Achieved Goals Future Research Guidelines A Summary of existing RNA secondary structure prediction methods 70 A.1 The Zuker Folding Algorithm: Energy Minimisation A.1.1 The mfold Program A.1.2 Suboptimal RNA Folding A.2 Covariance Models: SCFG-based RNA profiles A.2.1 Performance A.3 SCFGs for Homologous RNA Sequences using Tree Grammar EM A.4 Attempts to model pseudoknots A.5 More Approaches A.6 Additional Data A.6.1 Bulge Loop Distribution Counts B Notes on the HMM Implementation 78 C Source Code Overview 82 D An Extract of the Results 83 D.1 First RNA SCFG Model vii

11 Chapter 1 Introduction With the mapping of the human genome, an incredible amount of biological information is made available. To efficiently analyse this and to interpret the useful data, much methods by hand are being replaced by engineering approaches, resulting in the science called bioinformatics. One of the challenges in bioinformatics is the prediction of RNA secondary structure, a complex problem for which the results include applications in medical sciences, biology and phylogenetic evolution. At the start of this investigation, the goals were not completely specified, due to the experimental nature of this research field. They would depend on the progress and the intermediate results. On the other hand, one thing that was very clear from the beginning, was that developing a completely new and functioning algorithm for RNA secondary structure prediction is a task that largely exceeds the period of time of just one final year project. Therefore, the goals were restricted to a preparing investigation. These goals are threefold: a study of the key areas (biology, probability theory, dynamic programming and formal language theory), and the known methods for RNA secondary structure prediction, the development of algorithms and programs that can be used as tools to facilitate the development of a prediction program, if time allows, making a draft version of a model that predicts simplified RNA structures. A discussion of biosequence analysis requires backgrounds in several key areas. After the first chapter, that consists of this project description, an introduction to biology, the first key area, is given in chapter 2, along with the arguments why the knowledge of RNA secondary structure is so important. In chapter 3, an algorithm for listing repetitions and biological palindromes in DNA or RNA sequences is developed. It can be applied to, for instance, algorithms for multiple 1

12 alignments of biological sequences, or structure prediction algorithms that start with a complete listing of all possible strand folding sites. The presented algorithm is based on the suffix tree structure, which allows efficient searches for string matches. When analysing DNA sequences, they are often described as generated by a hidden Markov model. This is a probabilistic model, that can be seen in the larger context of transformational grammars, and subsequently extended to the model of a stochastic context-free grammar. This last one has less restrictions than the hidden Markov model and is appropriate to model the specific RNA secondary structure characteristics. All the terms mentioned here are explained in chapter 4, along with the description of the accompanying basic problems and solving algorithms. The dynamic programming algorithm for general structure prediction from chapter 4 can be written in a specific form for RNA sequences, reducing its time and memory complexity. This algorithm is presented in chapter 5. It is then transformed to a probabilistic version. In chapter 6, two probabilistic models for RNA secondary structure prediction are designed and implemented. They are based on an extension of the algorithm from the previous chapter, and use characteristics borrowed from a thermodynamic structure prediction approach, the Zuker algorithm. They are trained with sequences from trna databases, after which their prediction accuracy is tested on sequences with unknown structures. The report concludes with the a list of improvements of the presented models and programs in chapter 7, and possible approaches to continue the development of a complete RNA secondary structure prediction algorithm. The experimental nature of the research in this area was confirmed a few times by trying new methods that had to be abandoned later because of the lack of useful results. These methods are not included in the report. Wherever possible, the explanations in this report are limited to the argumentations behind the programmed methods, their results and the conclusions. Apart from some algorithms in pseudo code, program source code is not enclosed. It can be found on the accompanying cd-rom, and contains additional information on the implementation. Some parts of the report, like the explanations for the suffix trees, are extensively illustrated with explaining diagrams. This is assumed helpful for understanding the large text descriptions (and moreover requested by my instructor). Also, some parts with existing theories are adopted from textbooks or articles, in case they describe relatively unknown matters that are absolutely necessary for good understanding of the overview of this investigation. This dissertation was written in English since it was the result of a final year project carried out at the University of Cantabria in Spain, where I studied as a participant in the Erasmus exchange programme. A last introductory note is that the macro package L A TEX 2εwas chosen as a word processor, because of its predefined, professional layout, its capacities of typesetting mathematical formulae and the ease in structuring documents [15]. 2

13 Chapter 2 Molecular Genetics and the Sequencing Evolution The complete set of instructions for making an organism is called its genome. It contains the master blueprint for all cellular structures and activities for the lifetime of the cell or organism. Both technological innovation and the realization that genomic sequencing is fundamental to the study of life on earth have been greatly encouraging efforts to obtain the entire genomic sequences for several organisms in the last few years. Because of those new technologies, science can start to apply engineering methods to the study of biology. This chapter gives an overview of the biological backgrounds on the structures analysed in this work. 2.1 Unravelling the Genome Found in every cell of an organisms many trillions of cells, the genome consists of tightly coiled threads of DNA and associated protein molecules, organized in structures called chromosomes (see fig. 2.1). The human genome contains 24 chromosomes, in which 3 billion DNA characters are organized DNA Cells are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the DNA (deoxyribonucleic acid). DNA from all organisms is made up of the same chemical and physical components. It s a macromolecule of two strands wrapped around each other as a double helix (see fig. 2.1). Each of the strands contains repeating similar units, nucleic acids (or nucleotides), each characterized by one of four different bases: adenine (A), cytosine (C), guanine (G) and thymine (T). Nucleic acids are strung together by covalent bonds linking the 3 hydroxyl group of one nucleic acid to the phosphate group at the 5 carbon of the sugar of the next nucleic acid. Thus DNA strands form linear chains and have a direction specified by the 3

14 Figure 2.1: The different levels in the genome. On the highest level, the genome consists of a set of chromosomes. On the lowest level, there is the double DNA strand, in which each strand consists of a sequence of nucleotides, each characterized by a base. (Figure copyright US National Human Genome Research Institute) 4

15 5 and 3 ends 1. The two strands are held together by weak bonds between the bases, where A pairs with T, and C pairs with G, forming the so-called Watson-Crick base pairs. Basically, this complementary pairing means the two strands contain the same information. Exactly fifty years ago, Watson and Crick showed that the base-pairing nature of DNA allowed this genetic information to be carried accurately [27]. Two mechanisms can be explained by this knowledge. The first one is DNA replication: during cell division the two chains split and they serve as a template for the complementary bases, forming two new double-stranded DNA chains (one for each daughter cell). The second one is gene expression: one section of a chain, a gen, collects complementary bases and creates a single-chained linear molecule that codes for a protein. This is the RNA molecule RNA In eucaryotic cells, DNA is located inside the nucleus, whereas protein synthesis occurs in the cytoplasm which is outside the nucleus. Therefore, there must be another information containing molecule that can transfer the genetic information from the DNA inside the nucleus to the protein synthesis site in the cytoplasm. This is the function of the RNA (ribonucleic acid) molecule. It is chemically very similar to DNA. It also consists of a long chain of nucleotides, and the main differences with DNA are that RNA is single-stranded and thymine is replaced with a similar base uracil (U). As said, this type of RNA codes for a protein. It s called messenger RNA (mrna), and moves to the ribosomes where transcription occurs by means of another RNA type, transfer RNA. This trna considers the single chain of bases in mrna as divided into parts of three bases (codons), and for each triplet it selects the corresponding amino acid (out of the twenty existing amino acids). It then assembles these amino acids into molecules known as proteins. Other types of RNA also exist, like ribosomal RNA (rrna). Some RNAs are necessary in eg. virus functioning. RNA Secondary Structure RNA is one of nature s most complex machines. It can carry information in the sequence of bases and it can perform certain functions based on how it folds in space, which is dictated by the sequence of bases. Whereas the DNA strings are entwined as a double helix, the single RNA strand folds in space and interacts with itself, resulting in a two-dimensional structure known as the RNA secondary structure. Because of the chemical nature of the bases, base pairs can form (creating stems in the molecule). The strength of a base pair is not as strong as a the covalent bonds stringing the sequence together and therefore the ordering of the 1 Most DNA processing occurs in a linear manner starting from the 5 side and proceeds to the 3 end. 5

16 bases in the original sequence remains invariant regardless of the base pairs that form. The most common base pairs are Watson-Crick (A pairs with U and C pairs with G), but in nature wobble pairs (G pairs with U) also occur. The secondary structure in RNA is usually characterized by specifying which bases interact. When base pairs form, the sequence is divided into stems which are the base-pairing regions, and loops which are the non base-pairing loops. A A A U U A G A-U G-C G-C U-A C-G G-U G-C C-G U-G A G 5' G A C U A G-C G-C C-G UA U U U ACC GAC GA G A UGC CUG CU A U G U A A A A C UU 3' Figure 2.2: An example of RNA secondary structure: part of the small subunit ribosomal RNA molecule of Tetrahymena bergeri. Bars mark the base-paired stems. If we forget about the wobble pairs for a second, the base pair sequences which form stems in the secondary structure are complementary in the Watson-Crick way, and running in the opposite direction (see fig. 2.2). Such base pairs constitute the so-called biological palindromes 2 in the sequence, or complementary reversals, as they are called in chapter 3. Base pairs almost always occur in a nested fashion in RNA secondary structure. This means that if we draw arcs over an RNA sequence connecting the base pairs, none of the arcs need to cross each other. Formally, a base pair between positions i and j and a base pair between positions i and j are nested if and only if i < i < j < j or i < i < j < j. When non-nested base pairs occur, they are called pseudoknots (see fig. 2.3). A pseudoknot is formed when bases that are enclosed between two parts of a stem form a new stem with bases from another part of the sequence. Pseudoknotted RNA structures occur in virtually all classes of RNA and are involved in a number of important functions. Functionally related RNAs often have the same secondary structure, while their sequence similarity has mainly vanished throughout evolution. One possibility is that they 2 A palindrome is a word or sentence that reads the same forwards as backwards, like Doc, note. I dissent. A fast never prevents a fatness. I diet on cod., credited to Peter Hilton, a member of the British cryptography team that cracked the German Enigma code in World War II. 6

17 (a) 5' UUCCG A AGGGCAACUCGA A A A UGAGCU 3' (b) UUCCGAAGCUCAACGGGAAAAUGAGCU (*((( [[[[[[ )))*) ]]]]]] Figure 2.3: (a) A representation of a pseudoknot. At the 5 end starts a stem that encloses bases that pair with other bases at the 3 end. Bars mark the Watson-Crick base pairs, the dot marks a wobble pair. (b) Characterization of the secondary structure by indication of the normal stem with ( and ) (and for the wobble pair), and the pseudoknot with [ and ]. Note that there is some ambiguity in this example: there is one normal stem and one pseudoknot, but one is free to choose which is which. In most cases, it is quite clear which one is the pseudoknot because it spans a lot of normal stems. This example is an RNA inhibitor of the human deficiency virus reverse transcriptase (Tuerk, MacDougal & Gold 1992). have descended from a common ancestor. This makes the knowledge of the secondary structure an important resource for evolutionary research. Moreover, RNA molecules that fulfill similar functions in different organisms tend to conserve their secondary structure rather than their linear sequence, which has been mutating through evolutionary history. These reasons of evolutionary history and relationship underline the great importance of the knowledge of the secondary structure of an RNA molecule [5] Proteins A protein is a polypeptide chain, composed out of the twenty codons (triplets of bases), that spontaneously folds into a well defined three-dimensional structure. Proteins play many roles in an organism, and a lot of different types exist. Proteins make up much of the structure of organisms, or help in muscle movement, which is made possible because of a type of proteins that can contract. Proteins that speed up chemical reactions (without being destroyed) are called enzymes. Many proteins serve as messengers either between different parts of the cell or between cells, or help to turn genes on or off depending on the cell s environment. Proteins also take part in active transport, like pumping materials into and out off cells or between cells. They finally also help in receiving information from the environment, like chemical or other signals a cell receives. This is done by receptor proteins, which enable the cell to recognize the information so it can react 7

18 on it. Understanding how genes function will require analyses of the three-dimensional structures of the proteins for which the genes code (summarized in the scheme of fig. 2.4). Unfortunately, while the entire information for a protein s 3-D structure appears to reside in the primary sequence, attempts to predict the structure, and henceforth the function, from sequence alone have been unsuccessful. Since the medical and biological interests in this problem are enormous, protein structure prediction is a highly active investigation area. sequence structure f unction Figure 2.4: The fundamental idea on which protein structure prediction research is based. The arrows stand for determines. 2.2 Sequence Databases Mapping and Sequencing the Human Genome Public databases contain the complete nucleotide sequence of the human genome and those of selected model organisms. Four major databases store nucleotide sequences: GenBank (maintained by the US National Center for Biotechnology Information, NCBI) and the Genome Sequence DataBase (GSDB) in the United States, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database in the United Kingdom, and the DNA Database of Japan (DDBJ). The Databases collaborate to share sequences, which are compiled from direct author submissions and journal scans Database Search The wealth of sequence data has made the use of fast and efficient search algorithms necessary. Database search involves finding new sequences in a database that are by some measure similar to a model that represents what is being sought. Sometimes this model is a single sequence and a simplistic model of evolution as is the case with the popular tool BLASTR(Basic Local Alignment Search Tool). This is a set of similarity programs designed to explore all of the available sequence databases. The programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationship with the query sequence. Another popular search method uses FASTA. FASTA is a collection of searching programs, distributed by Dr. W. R. Pearson of the University of Virginia. These programs use the FASTA algorithm based on the assignation of scores to sequences, depending on the 8

19 number of identities with the query sequence. Other models can involve a set of sequences and a probabilistic model or an abstracted model of physics and chemistry. Discrimination is at the basis of all these models: starting with the complete list of database entries, the programs exclude sequences according to their algorithms until only the sequences similar to the query are left. Probabilistic machine learning and Bayes decision theory address this issue of discrimination in a formal manner. 2.3 Sequence Similarity Multiple Alignments A multiple alignment is a tool used in biology to show the correspondence between a set of sequences [26]. This correspondence is shown by aligning in columns the portions of the sequences that are similar (see fig. 2.5). The multiple alignment is important because it allows sequences to be viewed structurally despite the mutations caused by evolution. ((((((( (((( )))) ((((( ))))) ((((( )))))))))))) 1. -AUUUAUAUAGUUUAAUA------AAAACAUUACAUUUUCAUUGUAAAA A-UAAAAUUUUUAU-AUUUUUAUAAAUU AAGGAGUUAGUUAAA---AU---AUAACAUUAGAAUGUCAAUCUAAAA U-AACUA--AAAA---UAGUACACCUUG GCGGGUAUAGUUUAGU--GGU--AAAACCUUAGCCUUCCAAGCUAACG A-UGCGGGUUCGAUUCCCGCUACCCGCU UUCUUAAUAGCUUAGU--GGUU-AAAGCAUUCGGCUGUUAACCGAAAU A-CACUAGUUCAAUUCUAGUUUAAGAAG AAAUCUAUAAUUUAAU--GGAU-AAAAUAAAAACCUUCUAAGUUUUAU A-UGUAAGUUCAAAUCUUACUAGAUUUA GCUUGCUUAACUCAAUC-GGU--AGAGUAUCGGUUUUGUAAACCGAAG GUUAUCGGUUCAACUCCGAUAGCAAGCU UGCGCGGUAGGAGAGU--GGA--ACUCCGACGGGCUCAUAACCCGUAG GUCCCAGGAUCGAAACCUGGCCGCGCAA--- Figure 2.5: A multiple alignment of trna for seven trna sequences from the EMBL Data Library. Abbreviations are ( and ) for base-paired columns and for deletions by skip productions Biological View As yet stated in a previous paragraph, the similarity reflected by the secondary structure groups together different objects that have descended from a common ancestor. This concept of a common ancestor is known as homology. Homologous molecules usually share a common function. 9

20 Chapter 3 Sequence Applications of Suffix Trees Unlike the grammar-based methods presented in chapter 4, some methods for investigating the RNA secondary structure start with a listing of all possible base pair folding sites. Finding all these sites requires an extensive string search in the sequence, at the basis of which is a string matching algorithm. Matching string sequences is a problem that computer programmers face on a regular basis. In the context of DNA and RNA sequencing, the string matching problem comes down to a search for common substrings in two sequences or, as examined below, when searching for repetitions in a single sequence. At this point it is obvious that a brute force string search is going to be terribly inefficient. This type of search would require to perform a string comparison at every single nucleotide in the sequence, requiring O(N 3 ) time complexity, N being the length of the string. A solution to this problem is applying an efficient string matching algorithm, based on the suffix tree data structure. This chapter discusses the use of a suffix tree based algorithm when looking for repetitions and complementary reversals (biological palindromes) in a nucleotide sequence. In order to start with the general case without wobble pairing, the algorithms are developed for a DNA sequence, in which only Watson-Crick base pairing matters. An ANSI C implementation [9] of this algorithm is programmed and its performance is tested at the end of this chapter. Since the suffix tree is a relatively unknown data structure, its mechanics will be explained here, going a little more in detail for the points that are usually too little illustrated. 3.1 The Suffix Tree If x 1 x 2...x i...x n is a sequence, x i x i+1...x n is called a suffix and x 1 x 2...x j a prefix of that sequence, for all i, j = 1 to n. For example, for the sequence MISSISSIPPI, MISS is a prefix of that sequence and IPPI a suffix, and ISSISS is a suffix of the prefix MISSISS. In order to define the suffix tree, it s necessary to introduce the concept of a suffix trie. 10

21 3.1.1 Suffix Trie Each sequence of characters can be represented by a trie, a kind of tree that contains every suffix of the sequence. Consider the sequence BOOK, with suffixes BOOK, OOK, OK and K. Figure 3.1 shows how the suffix trie is constructed. At the start there s only an empty trie, which contains only node 0 (fig. 3.1(a)). Then, beginning at node 0, the biggest suffix is added, one edge and one node per character of the suffix. Each edge is labelled by one character (fig. 3.1(b)). The following suffix, OOK, is also added at node 0, but since it starts with a different character, it defines a new edge, resulting in fig. 3.1(c). The next suffix to be added, OK, has the same start character as a suffix that s already represented, starting with an edge labelled O. So this edge is followed to the next node, where a check is performed to see if it has an edge starting with K. Because this is not the case yet, a new edge labelled K is created (fig. 3.1(d)). Adding the last suffix, K, is easy, and the result is the suffix trie (fig. 3.1(e)). Thanks to this way of constructing, there will never be a node from which two edges with the same character leave. In this suffix trie, every suffix of the sequence can be found by starting at node 0 and walking down the tree. (a) (b) (c) (d) (e) B B O B O B O K O O O O O K O O K O O K O K O K K K K K suffix trie Figure 3.1: Construction of the suffix trie for the sequence BOOK. The most important characteristic of the suffix trie is that one can search for any subsequence of the word by starting at node 0 and following the matches down. If at any moment the correctly matching edge isn t present or the tree just stops, the search stops and this means that the subsequence isn t part of the sequence. The strength of this method is the speed of subsequence searches. If the collected works of Shakespeare are written as a string sequence and its suffix trie is constructed, determining if the word BOOK is part of it can be determined by only performing four character comparisons. Although searching in suffix trees can be done very fast, it might be clear by now that constructing a suffix tree is a task that will require lots of time (and space). Concretely, 11

22 it requires O(N 2 ) time and space, where N is the length of the sequence. This quadratic performance makes it impossible to deal with large subsequences. A method to deal with these problems is using a suffix tree, based on the suffix trie Suffix Tree Definition Suffix trees are compressed tries, which contain all suffixes of a given string sequence. To get a suffix tree, path compression is applied, a method proposed by Edward McCreight in 1976: nodes from which only one edge leaves, are eliminated, so that individual edges in the tree may now represent sequences of text instead of individual characters B O K BOOK O K 1,4 2,2 4,4 O O K OK K 3,4 4,4 O K K BOOK 1234 BOOK 1234 BOOK 1234 suffix trie suffix tree suffix tree (with indices on edges) Figure 3.2: Constructing the suffix tree out of the suffix trie. Figure 3.2 shows how the suffix tree of the sequence BOOK can be obtained from its suffix trie in an intuitive way. The first drawing represents the suffix trie for BOOK, in which all distinct suffixes from the sequence can be found, starting from node 0, and every character in these suffixes labels one edge between the nodes. In the second part, path compression is applied, eliminating nodes with only a single leaving edge. The edges can now represent substrings of characters. All information of the suffix trie is conserved, but less memory is used, due to the eliminating of nodes. The third part of figure 3.2 shows the suffix tree as it is used concretely: the substrings labelling the edges have been replaced by the corresponding start and end indices in the sequence. McCreight s path compression led to the suffix tree data-structure, eliminating a large number of nodes so that time and space complexity are reduced to O(N). This makes the suffix tree a very reasonable structure for sequence problems, requiring only a one-time pre-processing investment. His first algorithm to construct suffix trees had one important disadvantage, namely that the tree had to be built in reverse order, starting with the last characters of the sequence. This ruled out on-line processing, where the tree is constructed at the same time new characters (at the end) of the string are received. 12

23 3.1.3 Mechanics In 1995, Esko Ukkonen proposed an efficient algorithm that allowed on line processing (see [24] or a summarized description as in [13]). His algorithm starts with an empty tree (node 0), then progressively adding each of the N prefixes of the string sequence to the suffix tree. If an extra character is added to the sequence, the tree can be updated by adding the next prefix. The process in which one prefix is added to the suffix tree is called a phase. In each phase, every suffix of the current prefix is added to the tree. First this is done for the longest suffix, and then it works its way down to the shortest suffix, which is the empty string. This way every substring of the sequence will be contained in the tree: a random substring x i...x j (i j) in the sequence will be dealt with when adding suffix x i...x j that is part of the prefix x 1...x j B BO O BOO OO phase 1: phase 2: empty string B phase 3: BO phase 4: BOO Figure 3.3: The four phases in the suffix tree construction for the sequence BOO. In figure 3.3, an example for the sequence BOO is worked out. It shows the four phases of the construction of its suffix tree. Note that these could also be the first four stages in constructing the suffix tree of larger sequences, eg. BOOKKEEPER. In the first phase, the prefix to add to the tree is the empty string, resulting in a tree that contains only node 0. The first non-empty prefix to add to the tree is B, in phase 2. Adding prefix B means adding all its suffixes, starting with the biggest, B, and ending with the smallest, the empty string. To add suffix B, a new edge, labelled B, and a new node, labelled with its number, 1, are created. To add the empty string to the tree, nothing is changed because it is already contained. In phase 3, prefix BO is added, consisting of adding the suffixes BO, O and the empty string. This is done in the same way as it was done in the suffix trie, but now path compression is applied, resulting in an edge representing BO. For O, a new edge is created, because there weren t any edges leaving from node 0 yet that start with O. Adding prefix BOO in phase 4 is done exactly the same way. 13

24 Active Point Until now, updating the tree was easy. The only type of updates were creation of a new edge from node 0 and the extension of these edges. Suppose now the suffix tree for the sequence BOOKKEEPER is to be constructed. While further updating the already constructed tree for BOO with the remaining prefixes for BOOKKEEPER, a third type of update is needed. An example of this update is illustrated in figure 3.4. At one point in the construction of the tree, the drawing of figure 3.4 is reached, representing the tree for BOOKK. First of all, one can see that this tree contains every possible suffix of every prefix of BOOKK, starting at node 0. Each suffix ends at a node that consists of one of these three types: Leaf node: this is a node from which no edges leave, eg. node 2. Explicit node: a non-leaf node at a point in the tree where two or more edges part ways, eg. node 3. Implicit node: a position in the middle of a node. Those are nodes that appeared in the suffix trie, but due to the path compression, they don t represent nodes in the suffix tree, eg. the node in the middle of KK, between node 0 and node 5. 0 BOOKK O 1 3 OKK 2 KK 4 KK 5 phase 6: BOOKK Figure 3.4: Suffix tree for BOOKK. Updating the tree means visiting each of the suffixes in the existing tree and adding the next character in the sequence to the end the suffix. This can be done in three ways: 1. Adding a new edge with a new node to node Simple extension of an edge. 3. Creation of a new node by turning an implicit node into an explicit node and adding an edge with a leaf node attached. 14

25 The third type of extension happens for example when the tree for the prefix BOOKK is updated with a new prefix, BOOKKE. The suffix K from BOOKK ends in an implicit node part way down the edge defined by KK, between node 0 and node 5 (see fig. 3.4). When updating a suffix tree, the active point is defined as the position in the prefix where the first suffix starts that doesn t terminate at a leaf node. In this case the active point could be represented as the vertical line in BOOK K, because the suffix K is the first one encountered in BOOKK that doesn t end at a leaf node. This active point then defines the suffix K, and its length is said to be the length of the suffix starting at that point (this also explains the meaning of suffixes that are longer or shorter than the active point, as written in lower in this text.). It corresponds to the implicit node at the middle of the KK-branch in the suffix tree. (a) (b) 0 0 BOOKK O KK BOOKKE O K E 1 3 OKK 5 KK 1 OKKE 3 6 KKE E KE phase 6: BOOKK phase 7: BOOKKE Figure 3.5: Suffix trees for BOOKKE and BOOKKE. The prefix to add to the tree is BOOKKE. Updating the tree means visiting each of the suffixes in the existing tree, and adding the next character, being E, to the end the suffix (see fig. 3.5(a)). At this point there are two kind of suffixes. First are the ones that ended in a leaf node when dealing with the previous prefix. Updating these suffixes is easy, since they ended in a leaf node, and can be done by simply adding E to the string that ends in that node, because of the string compression. These are the suffixes that are dealt with first. Then a suffix that didn t end in a leaf node, K, is reached. This one starts at the active point. To update it, and all the smaller suffixes (in this case only the empty string), the node it ended in is traced, followed by a check if from that node yet starts an edge with the character to add. It results that K ended in an implicit node. Since none of the edges parting from that node begin with the character E (there s only one edge, which starts with K), this node is converted into an explicit node and add an edge labelled E and a new leaf node (fig. 3.5(b)). The same process is repeated for all smaller suffixes, to make sure they re in the tree. In this case the only smaller suffix is the empty string, corresponding to node 0, and since 15

26 there s no edge leaving yet from this node starting with E, a new node is added, labelled Basics Of The Algorithm The suffix tree has some characteristics that allow for a fairly efficient algorithm. The first important trait is this: first a leaf node, always a leaf node. Any node that s created as a leaf will never be given a descendant, it will only be extended through character concatenation. More importantly, every time a new suffix is added to the tree, the edges leading into every leaf node are going to be automatically extended by a single character. That character will be the last character in the new suffix. This makes management of the edges leading into leaf nodes easy. Any time a new leaf node is created, its edge is automatically set to represent all the characters from its starting point to the end of the input text. Even if those characters are unknown, it s certain they will be added to the tree eventually. Because of this, once a leaf node is created, it can can just be forgotten about. If the edge is split later on, its starting point may change, but it will still extend all the way to the end of the input text. This means the only necessary updates are the explicit and implicit nodes at the active point (which defined the first suffix that didn t end in a leaf node). Given this, only the strings from the active point to the empty string would have to be considered, testing each node for update eligibility. However, some time can be saved by stopping the update earlier. When walking through the suffixes, a new edge will be added to each node that doesn t have a descendant edge starting with the correct character. When finally a suffix is reached that corresponds to a node that has the correct character as a descendant, the update can stop, because all smaller suffixes have been updated the same way at a previous phase. The obvious conclusion is that, if a certain character is found as a descendant of a particular suffix, it s bound to be a descendant of every smaller suffix. End Point When adding a new prefix to the tree, the end point is the position where the first matching descendant is found, i.e. a suffix that ends in a node. Every suffix equal to or smaller than this point is already contained in the tree, meaning that these suffixes mark repetitions in the sequence, a useful argument for the next paragraph, where the algorithm will have to look for repetitions. The end point has an important extra feature that makes it particularly useful. Since leaf nodes were being added to every suffix between the active point and the end point, every suffix longer than the end point will end in a leaf node after the update. This means that the end point will turn into the active point on the next pass over the tree. 16

27 ACTGATTGGCTGGCTGGCTGA end point active point Figure 3.6: Active and end point divide the prefix in three zones Example: Active and End Point With the introduction of the end point, the prefix can now be divided in three zones (see fig. 3.6). Suppose the suffix tree for the sequence ACTGATTGGCTGGCTGGCTGA (a quasi-random sequence that contains a few repetitions) is being constructed, and the tree corresponding to the sequence ACTGATTGGCTGGCTGGCTG is already done. In that case the active and end point will both be found at the same location, as follows: ACTGATTGGC TGGCTGGCTG. If now the following prefix is added to the tree, the current active point changes into the previous end point (in this case its position stays the same), and the current end point is set to ACTGATTGGCTGGCTGG CTGA, since the first suffix that s not contained yet in the tree is CTGA, thus dividing the prefix in three zones: zone 1: Start characters for suffixes that end in leaf nodes and only need their last edge extended with the new character A. zone 2: Start characters for suffixes that are not contained yet in the tree and will therefore cause creation of new edges and leaf nodes. zone 3: Start characters for suffixes that are already contained in the tree. These suffixes end at non-leaf nodes and need no update. All suffixes starting in zone 3 mark repetitions in the sequence. By confining the updates to the suffixes of zone 2, a lot less preprocessing is required to update the tree. And by keeping track of the end point, the position of the active point in the following update is automatically known Suffix Pointer When navigating through the tree, something that could be implemented in a quite efficient way is looking for the node corresponding to the next smaller suffix. If done simply by walking down the tree until the correct node is found, the algorithm isn t going to run in linear time. To get around this, a so-called suffix pointer (see fig. 3.7) is introduced. This is a pointer found at each internal node, which points to the node that is the first suffix 17

28 AB 4 ABC 8 0 AB 1 B 2 C 5 AB 6 C 7 C ABC C C 3 Figure 3.7: The suffix pointers in the suffix trees for the string ABABABC, represented as arrows. Each suffix pointer starts at an explicit node and points from one suffix to the next one. The suffix pointer at node 4, for example, points from suffix ABAB to suffix BAB. of that string 1. So if a particular node represents a string containing characters 1 through N of the input text, the suffix pointer for that node will point to the node that is the termination point for the string starting at the root that represents characters 2 through N of the input text. The suffix pointers are built at the same time the update to the tree is taking place. When moving from the active point to the end point, the father node of each of the newly created leaves is remembered. Every creation of a new edge goes together with the creation of a suffix pointer from the father node of the last created leaf edge to the current father edge. Obviously, this can t be done for the first edge created in the update (since no previous leaf edges were added in the current phase), so for this one walking through the tree will still be required Source code Since the code for string searches in sequences using Ukkonen s algorithm is available as open source on the internet, there was no need to rewrite this part. The hereby presented program to look for repetitions and complementary reversals is based on the ANSI C implementation of the suffix tree written by Dotan Tsadok for his undergraduate project in Haifa University, Israel, in August In this program, the Ukkonen algorithm is implemented as described in [7]. Updating of the tree with one prefix (one phase) is done by running the function SPA, the single phase algorithm. For this prefix, every suffix is added to the tree with the function SEA, the single extension algorithm. In pseudo code, this looks like: 1 In practice, only the suffix pointers for explicit nodes are kept track of. For implicit nodes, suffix pointer lookup happens by tracing the father node s suffix pointer. The correct suffix is then found by walking down the correct edge starting at the node that the suffix pointer pointed to. For example, determining the suffix pointer for BABA in figure 3.7 consists of tracing its father node, 6, following the suffix pointer that starts there, to node 1, and walking down the correct edge (towards node 4) until ABA is encountered. 18

29 SEA(current_suffix) { test_char = last_char in new_suffix; follow current node s suffix link; if (suffix link ends at an explicit node) { if (the node has no descendant edge starting with test_char) create new leaf edge starting at the explicit node; else current phase is done; } else { if (the implicit node s next char isn t test_char) { split the edge at the implicit node; create new leaf edge starting at the split in the edge; } else current phase is done; } } Whenever a new node is created, the function SEA is called again for the next suffix. When the first suffix is reached that is already contained in the tree, the end point is set there, and the next prefix is dealt with (next phase). 3.2 Algorithm for Finding Repetitions Suppose now repetitions in the sequence are important, and the algorithm is to be extended to look for them. Constructing the suffix tree first would spill any information about repetitions, since a repetition in the sequence was actually a suffix that was equal to or shorter than the end pointer, and these suffixes are ignored because they are yet part of the tree. To detect repetitions, the algorithm is changed so that, in every phase, it looks what happens behind the end point. Making this extension time-efficient is not very obvious because the strength of the Ukkonen algorithm was precisely ignoring the suffixes behind the end point. Visiting the suffixes is done by following suffix pointers. The point at which a matching ascendant is found is set as the end point. The current end point is stored as a repetition, and the previous suffix this repetition it is equal to can be found as the one that starts at node 0 and ends in the node corresponding to the end point. All suffixes smaller than the end point are also repetitions. These suffixes can be found by following the remaining suffix pointers, and they are equal to a previous suffix that starts at node 0 and ends in 19

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular