RNA Secondary Structure Prediction by Stochastic Context-Free Grammars

Size: px
Start display at page:

Download "RNA Secondary Structure Prediction by Stochastic Context-Free Grammars"

Transcription

1 Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Coordinator: Prof. Dr. Eng. J.-P. Martens, Ghent University Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain DISSERTATION SUBMITTED IN ORDER TO OBTAIN THE ACADEMIC DEGREE OF ELECTRICAL ENGINEER Academic year

2

3 Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Coordinator: Prof. Dr. Eng. J.-P. Martens, Ghent University Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain DISSERTATION SUBMITTED IN ORDER TO OBTAIN THE ACADEMIC DEGREE OF ELECTRICAL ENGINEER Academic year

4 RNA Secondary Structure Prediction by Stochastic Context-Free Grammars by Steven Van Vaerenbergh Dissertation submitted in order to obtain the academic degree of Electrical Engineer Academic year University of Ghent Faculty of Applied Sciences Department of Electronics and Information Systems Head of the Department: Prof. Dr. Eng. J. Van Campenhout Coordinator: Prof. Dr. Eng. J.P. Martens, University of Ghent Instructor: Associate Prof. L. Vielva, University of Cantabria, Spain Summary The function most types of RNA molecules perform is determined by their structure, which on its turn is determined by the linear RNA sequence. Predicting the secondary structure of an RNA molecule out of the linear base sequence is a challenge in bioinformatics, with applications in medical sciences, biology and phylogenetic history. In this study, the different known methods of RNA secondary structure prediction are studied first. Then, a number of algorithms and programs are developed, as tools to apply to the problem. An existing algorithm for finding substrings in large strings using suffix trees is extended to an algorithm code that lists repetitions and biological palindromes in DNA or RNA sequences, and this is programmed in ANSI C. A basic hidden Markov model is programmed in Matlab, and then extended to the more general model of stochastic contextfree grammars. Algorithms for this model are implemented in Chomsky normal form. Next, the stochastic context-free grammars are described specifically for RNA modelling. At the end of the project, an attempt is made to develop a new prediction approach. Two probabilistic models are constructed, considering RNA molecule features as regarded in the existing thermodynamic approach from the Zuker algorithm. Extensions of the previous probabilistic algorithms are programmed for these two specific cases, the complete models are trained with sequences from RNA databases, and their prediction accuracy is tested on unknown sequences. Results suggest model improvements, and a list of refinements is suggested at the end of this report. Keywords: RNA, secondary structure, hidden Markov model, stochastic context-free grammar, suffix tree, bioinformatics. i

5 The author gives his permission to make this dissertation available for consultation and to copy parts of it for personal use. Any other use is subject to the restrictions of the copyright, in specific the obligation of explicit source reference when stating the results from this dissertation. Steven Van Vaerenbergh June ii

6 Acknowledgements First of all, I wish to thank my project instructor Luis Vielva, for his enthusiastic support and suggestions for different approaches to the encountered problems, and my coordinator Jean-Pierre Martens, for supervising this final year project. Helpful advice on some biological issues in this study was provided by Fernando de la Cruz, professor of Molecular Biology at the University of Cantabria. I also wish to thank my parents, for supporting me and giving me the opportunity to finish my studies, and my girlfriend Angela, for cheering me up when experimental results in this investigation tried the opposite. iii

7 It must be recognized that the notion probability of a sentence is an entirely useless one, under any interpretation of this term. Noam Chomsky, famous linguist, on the probabilistic approach to handle grammars Every time I fire a linguist, my system s performance improves. Fred Jelenik, former head of the IBM speech recognition group, on the statistical language recognizer 1 1 Credit for pairing these quotes goes to D. Jurafsky and J. Martin [8] iv

8 Contents 1 Introduction 1 2 Molecular Genetics and the Sequencing Evolution Unravelling the Genome DNA RNA Proteins Sequence Databases Mapping and Sequencing the Human Genome Database Search Sequence Similarity Multiple Alignments Biological View Sequence Applications of Suffix Trees The Suffix Tree Suffix Trie Suffix Tree Definition Mechanics Basics Of The Algorithm Example: Active and End Point Suffix Pointer Source code Algorithm for Finding Repetitions The Repetition List Restrictions on repetitions Algorithm for Finding Complementary Reversals Complementary Reversals (CRs) Algorithm: the Concatenation Approach Program Results Application to RNA Suffix Trees for Very Large Sequences Conclusions v

9 4 Stochastic Context-Free Grammars Hidden Markov Models Example Elements of an HMM Definition of an HMM Three Basic Problems and Solutions Transformational Grammars Linguistics Definition Example Parsing Chomksy Hierarchy of Transformational Grammars Regular Grammars Context-Free Grammars Stochastic Grammars Sequence Modelling with SCFGs The inside algorithm The outside algorithm Parameter re-estimation The CYK algorithm Implementation of the Chomsky normal form SCFG Sequence Generation by the Model Itself Implementation Application of the model in Chomsky Normal Form RNA Secondary Structure Prediction using SCFGs Terminology Simple RNA Secondary Structure Prediction The Nussinov Algorithm for Base Pair Maximization A First Nussinov-based SCFG algorithm Use of the Nussinov algorithm A General RNA SCFG Model Obtaining the SCFG Probabilities Model Choice Nonterminals Choice Algorithms choice Training Sequences: trna The First Model Model suppositions Parse tree suppositions Implementation Results vi

10 6.3.5 Interpretation Model Estimation by Inside-Outside Training The Second Model Model Suppositions Parse Tree Suppositions Implementation Results and Interpretation Further improvements Conclusions Conclusions Project overview Achieved Goals Future Research Guidelines A Summary of existing RNA secondary structure prediction methods 70 A.1 The Zuker Folding Algorithm: Energy Minimisation A.1.1 The mfold Program A.1.2 Suboptimal RNA Folding A.2 Covariance Models: SCFG-based RNA profiles A.2.1 Performance A.3 SCFGs for Homologous RNA Sequences using Tree Grammar EM A.4 Attempts to model pseudoknots A.5 More Approaches A.6 Additional Data A.6.1 Bulge Loop Distribution Counts B Notes on the HMM Implementation 78 C Source Code Overview 82 D An Extract of the Results 83 D.1 First RNA SCFG Model vii

11 Chapter 1 Introduction With the mapping of the human genome, an incredible amount of biological information is made available. To efficiently analyse this and to interpret the useful data, much methods by hand are being replaced by engineering approaches, resulting in the science called bioinformatics. One of the challenges in bioinformatics is the prediction of RNA secondary structure, a complex problem for which the results include applications in medical sciences, biology and phylogenetic evolution. At the start of this investigation, the goals were not completely specified, due to the experimental nature of this research field. They would depend on the progress and the intermediate results. On the other hand, one thing that was very clear from the beginning, was that developing a completely new and functioning algorithm for RNA secondary structure prediction is a task that largely exceeds the period of time of just one final year project. Therefore, the goals were restricted to a preparing investigation. These goals are threefold: a study of the key areas (biology, probability theory, dynamic programming and formal language theory), and the known methods for RNA secondary structure prediction, the development of algorithms and programs that can be used as tools to facilitate the development of a prediction program, if time allows, making a draft version of a model that predicts simplified RNA structures. A discussion of biosequence analysis requires backgrounds in several key areas. After the first chapter, that consists of this project description, an introduction to biology, the first key area, is given in chapter 2, along with the arguments why the knowledge of RNA secondary structure is so important. In chapter 3, an algorithm for listing repetitions and biological palindromes in DNA or RNA sequences is developed. It can be applied to, for instance, algorithms for multiple 1

12 alignments of biological sequences, or structure prediction algorithms that start with a complete listing of all possible strand folding sites. The presented algorithm is based on the suffix tree structure, which allows efficient searches for string matches. When analysing DNA sequences, they are often described as generated by a hidden Markov model. This is a probabilistic model, that can be seen in the larger context of transformational grammars, and subsequently extended to the model of a stochastic context-free grammar. This last one has less restrictions than the hidden Markov model and is appropriate to model the specific RNA secondary structure characteristics. All the terms mentioned here are explained in chapter 4, along with the description of the accompanying basic problems and solving algorithms. The dynamic programming algorithm for general structure prediction from chapter 4 can be written in a specific form for RNA sequences, reducing its time and memory complexity. This algorithm is presented in chapter 5. It is then transformed to a probabilistic version. In chapter 6, two probabilistic models for RNA secondary structure prediction are designed and implemented. They are based on an extension of the algorithm from the previous chapter, and use characteristics borrowed from a thermodynamic structure prediction approach, the Zuker algorithm. They are trained with sequences from trna databases, after which their prediction accuracy is tested on sequences with unknown structures. The report concludes with the a list of improvements of the presented models and programs in chapter 7, and possible approaches to continue the development of a complete RNA secondary structure prediction algorithm. The experimental nature of the research in this area was confirmed a few times by trying new methods that had to be abandoned later because of the lack of useful results. These methods are not included in the report. Wherever possible, the explanations in this report are limited to the argumentations behind the programmed methods, their results and the conclusions. Apart from some algorithms in pseudo code, program source code is not enclosed. It can be found on the accompanying cd-rom, and contains additional information on the implementation. Some parts of the report, like the explanations for the suffix trees, are extensively illustrated with explaining diagrams. This is assumed helpful for understanding the large text descriptions (and moreover requested by my instructor). Also, some parts with existing theories are adopted from textbooks or articles, in case they describe relatively unknown matters that are absolutely necessary for good understanding of the overview of this investigation. This dissertation was written in English since it was the result of a final year project carried out at the University of Cantabria in Spain, where I studied as a participant in the Erasmus exchange programme. A last introductory note is that the macro package L A TEX 2εwas chosen as a word processor, because of its predefined, professional layout, its capacities of typesetting mathematical formulae and the ease in structuring documents [15]. 2

13 Chapter 2 Molecular Genetics and the Sequencing Evolution The complete set of instructions for making an organism is called its genome. It contains the master blueprint for all cellular structures and activities for the lifetime of the cell or organism. Both technological innovation and the realization that genomic sequencing is fundamental to the study of life on earth have been greatly encouraging efforts to obtain the entire genomic sequences for several organisms in the last few years. Because of those new technologies, science can start to apply engineering methods to the study of biology. This chapter gives an overview of the biological backgrounds on the structures analysed in this work. 2.1 Unravelling the Genome Found in every cell of an organisms many trillions of cells, the genome consists of tightly coiled threads of DNA and associated protein molecules, organized in structures called chromosomes (see fig. 2.1). The human genome contains 24 chromosomes, in which 3 billion DNA characters are organized DNA Cells are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the DNA (deoxyribonucleic acid). DNA from all organisms is made up of the same chemical and physical components. It s a macromolecule of two strands wrapped around each other as a double helix (see fig. 2.1). Each of the strands contains repeating similar units, nucleic acids (or nucleotides), each characterized by one of four different bases: adenine (A), cytosine (C), guanine (G) and thymine (T). Nucleic acids are strung together by covalent bonds linking the 3 hydroxyl group of one nucleic acid to the phosphate group at the 5 carbon of the sugar of the next nucleic acid. Thus DNA strands form linear chains and have a direction specified by the 3

14 Figure 2.1: The different levels in the genome. On the highest level, the genome consists of a set of chromosomes. On the lowest level, there is the double DNA strand, in which each strand consists of a sequence of nucleotides, each characterized by a base. (Figure copyright US National Human Genome Research Institute) 4

15 5 and 3 ends 1. The two strands are held together by weak bonds between the bases, where A pairs with T, and C pairs with G, forming the so-called Watson-Crick base pairs. Basically, this complementary pairing means the two strands contain the same information. Exactly fifty years ago, Watson and Crick showed that the base-pairing nature of DNA allowed this genetic information to be carried accurately [27]. Two mechanisms can be explained by this knowledge. The first one is DNA replication: during cell division the two chains split and they serve as a template for the complementary bases, forming two new double-stranded DNA chains (one for each daughter cell). The second one is gene expression: one section of a chain, a gen, collects complementary bases and creates a single-chained linear molecule that codes for a protein. This is the RNA molecule RNA In eucaryotic cells, DNA is located inside the nucleus, whereas protein synthesis occurs in the cytoplasm which is outside the nucleus. Therefore, there must be another information containing molecule that can transfer the genetic information from the DNA inside the nucleus to the protein synthesis site in the cytoplasm. This is the function of the RNA (ribonucleic acid) molecule. It is chemically very similar to DNA. It also consists of a long chain of nucleotides, and the main differences with DNA are that RNA is single-stranded and thymine is replaced with a similar base uracil (U). As said, this type of RNA codes for a protein. It s called messenger RNA (mrna), and moves to the ribosomes where transcription occurs by means of another RNA type, transfer RNA. This trna considers the single chain of bases in mrna as divided into parts of three bases (codons), and for each triplet it selects the corresponding amino acid (out of the twenty existing amino acids). It then assembles these amino acids into molecules known as proteins. Other types of RNA also exist, like ribosomal RNA (rrna). Some RNAs are necessary in eg. virus functioning. RNA Secondary Structure RNA is one of nature s most complex machines. It can carry information in the sequence of bases and it can perform certain functions based on how it folds in space, which is dictated by the sequence of bases. Whereas the DNA strings are entwined as a double helix, the single RNA strand folds in space and interacts with itself, resulting in a two-dimensional structure known as the RNA secondary structure. Because of the chemical nature of the bases, base pairs can form (creating stems in the molecule). The strength of a base pair is not as strong as a the covalent bonds stringing the sequence together and therefore the ordering of the 1 Most DNA processing occurs in a linear manner starting from the 5 side and proceeds to the 3 end. 5

16 bases in the original sequence remains invariant regardless of the base pairs that form. The most common base pairs are Watson-Crick (A pairs with U and C pairs with G), but in nature wobble pairs (G pairs with U) also occur. The secondary structure in RNA is usually characterized by specifying which bases interact. When base pairs form, the sequence is divided into stems which are the base-pairing regions, and loops which are the non base-pairing loops. A A A U U A G A-U G-C G-C U-A C-G G-U G-C C-G U-G A G 5' G A C U A G-C G-C C-G UA U U U ACC GAC GA G A UGC CUG CU A U G U A A A A C UU 3' Figure 2.2: An example of RNA secondary structure: part of the small subunit ribosomal RNA molecule of Tetrahymena bergeri. Bars mark the base-paired stems. If we forget about the wobble pairs for a second, the base pair sequences which form stems in the secondary structure are complementary in the Watson-Crick way, and running in the opposite direction (see fig. 2.2). Such base pairs constitute the so-called biological palindromes 2 in the sequence, or complementary reversals, as they are called in chapter 3. Base pairs almost always occur in a nested fashion in RNA secondary structure. This means that if we draw arcs over an RNA sequence connecting the base pairs, none of the arcs need to cross each other. Formally, a base pair between positions i and j and a base pair between positions i and j are nested if and only if i < i < j < j or i < i < j < j. When non-nested base pairs occur, they are called pseudoknots (see fig. 2.3). A pseudoknot is formed when bases that are enclosed between two parts of a stem form a new stem with bases from another part of the sequence. Pseudoknotted RNA structures occur in virtually all classes of RNA and are involved in a number of important functions. Functionally related RNAs often have the same secondary structure, while their sequence similarity has mainly vanished throughout evolution. One possibility is that they 2 A palindrome is a word or sentence that reads the same forwards as backwards, like Doc, note. I dissent. A fast never prevents a fatness. I diet on cod., credited to Peter Hilton, a member of the British cryptography team that cracked the German Enigma code in World War II. 6

17 (a) 5' UUCCG A AGGGCAACUCGA A A A UGAGCU 3' (b) UUCCGAAGCUCAACGGGAAAAUGAGCU (*((( [[[[[[ )))*) ]]]]]] Figure 2.3: (a) A representation of a pseudoknot. At the 5 end starts a stem that encloses bases that pair with other bases at the 3 end. Bars mark the Watson-Crick base pairs, the dot marks a wobble pair. (b) Characterization of the secondary structure by indication of the normal stem with ( and ) (and for the wobble pair), and the pseudoknot with [ and ]. Note that there is some ambiguity in this example: there is one normal stem and one pseudoknot, but one is free to choose which is which. In most cases, it is quite clear which one is the pseudoknot because it spans a lot of normal stems. This example is an RNA inhibitor of the human deficiency virus reverse transcriptase (Tuerk, MacDougal & Gold 1992). have descended from a common ancestor. This makes the knowledge of the secondary structure an important resource for evolutionary research. Moreover, RNA molecules that fulfill similar functions in different organisms tend to conserve their secondary structure rather than their linear sequence, which has been mutating through evolutionary history. These reasons of evolutionary history and relationship underline the great importance of the knowledge of the secondary structure of an RNA molecule [5] Proteins A protein is a polypeptide chain, composed out of the twenty codons (triplets of bases), that spontaneously folds into a well defined three-dimensional structure. Proteins play many roles in an organism, and a lot of different types exist. Proteins make up much of the structure of organisms, or help in muscle movement, which is made possible because of a type of proteins that can contract. Proteins that speed up chemical reactions (without being destroyed) are called enzymes. Many proteins serve as messengers either between different parts of the cell or between cells, or help to turn genes on or off depending on the cell s environment. Proteins also take part in active transport, like pumping materials into and out off cells or between cells. They finally also help in receiving information from the environment, like chemical or other signals a cell receives. This is done by receptor proteins, which enable the cell to recognize the information so it can react 7

18 on it. Understanding how genes function will require analyses of the three-dimensional structures of the proteins for which the genes code (summarized in the scheme of fig. 2.4). Unfortunately, while the entire information for a protein s 3-D structure appears to reside in the primary sequence, attempts to predict the structure, and henceforth the function, from sequence alone have been unsuccessful. Since the medical and biological interests in this problem are enormous, protein structure prediction is a highly active investigation area. sequence structure f unction Figure 2.4: The fundamental idea on which protein structure prediction research is based. The arrows stand for determines. 2.2 Sequence Databases Mapping and Sequencing the Human Genome Public databases contain the complete nucleotide sequence of the human genome and those of selected model organisms. Four major databases store nucleotide sequences: GenBank (maintained by the US National Center for Biotechnology Information, NCBI) and the Genome Sequence DataBase (GSDB) in the United States, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database in the United Kingdom, and the DNA Database of Japan (DDBJ). The Databases collaborate to share sequences, which are compiled from direct author submissions and journal scans Database Search The wealth of sequence data has made the use of fast and efficient search algorithms necessary. Database search involves finding new sequences in a database that are by some measure similar to a model that represents what is being sought. Sometimes this model is a single sequence and a simplistic model of evolution as is the case with the popular tool BLASTR(Basic Local Alignment Search Tool). This is a set of similarity programs designed to explore all of the available sequence databases. The programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationship with the query sequence. Another popular search method uses FASTA. FASTA is a collection of searching programs, distributed by Dr. W. R. Pearson of the University of Virginia. These programs use the FASTA algorithm based on the assignation of scores to sequences, depending on the 8

19 number of identities with the query sequence. Other models can involve a set of sequences and a probabilistic model or an abstracted model of physics and chemistry. Discrimination is at the basis of all these models: starting with the complete list of database entries, the programs exclude sequences according to their algorithms until only the sequences similar to the query are left. Probabilistic machine learning and Bayes decision theory address this issue of discrimination in a formal manner. 2.3 Sequence Similarity Multiple Alignments A multiple alignment is a tool used in biology to show the correspondence between a set of sequences [26]. This correspondence is shown by aligning in columns the portions of the sequences that are similar (see fig. 2.5). The multiple alignment is important because it allows sequences to be viewed structurally despite the mutations caused by evolution. ((((((( (((( )))) ((((( ))))) ((((( )))))))))))) 1. -AUUUAUAUAGUUUAAUA------AAAACAUUACAUUUUCAUUGUAAAA A-UAAAAUUUUUAU-AUUUUUAUAAAUU AAGGAGUUAGUUAAA---AU---AUAACAUUAGAAUGUCAAUCUAAAA U-AACUA--AAAA---UAGUACACCUUG GCGGGUAUAGUUUAGU--GGU--AAAACCUUAGCCUUCCAAGCUAACG A-UGCGGGUUCGAUUCCCGCUACCCGCU UUCUUAAUAGCUUAGU--GGUU-AAAGCAUUCGGCUGUUAACCGAAAU A-CACUAGUUCAAUUCUAGUUUAAGAAG AAAUCUAUAAUUUAAU--GGAU-AAAAUAAAAACCUUCUAAGUUUUAU A-UGUAAGUUCAAAUCUUACUAGAUUUA GCUUGCUUAACUCAAUC-GGU--AGAGUAUCGGUUUUGUAAACCGAAG GUUAUCGGUUCAACUCCGAUAGCAAGCU UGCGCGGUAGGAGAGU--GGA--ACUCCGACGGGCUCAUAACCCGUAG GUCCCAGGAUCGAAACCUGGCCGCGCAA--- Figure 2.5: A multiple alignment of trna for seven trna sequences from the EMBL Data Library. Abbreviations are ( and ) for base-paired columns and for deletions by skip productions Biological View As yet stated in a previous paragraph, the similarity reflected by the secondary structure groups together different objects that have descended from a common ancestor. This concept of a common ancestor is known as homology. Homologous molecules usually share a common function. 9

20 Chapter 3 Sequence Applications of Suffix Trees Unlike the grammar-based methods presented in chapter 4, some methods for investigating the RNA secondary structure start with a listing of all possible base pair folding sites. Finding all these sites requires an extensive string search in the sequence, at the basis of which is a string matching algorithm. Matching string sequences is a problem that computer programmers face on a regular basis. In the context of DNA and RNA sequencing, the string matching problem comes down to a search for common substrings in two sequences or, as examined below, when searching for repetitions in a single sequence. At this point it is obvious that a brute force string search is going to be terribly inefficient. This type of search would require to perform a string comparison at every single nucleotide in the sequence, requiring O(N 3 ) time complexity, N being the length of the string. A solution to this problem is applying an efficient string matching algorithm, based on the suffix tree data structure. This chapter discusses the use of a suffix tree based algorithm when looking for repetitions and complementary reversals (biological palindromes) in a nucleotide sequence. In order to start with the general case without wobble pairing, the algorithms are developed for a DNA sequence, in which only Watson-Crick base pairing matters. An ANSI C implementation [9] of this algorithm is programmed and its performance is tested at the end of this chapter. Since the suffix tree is a relatively unknown data structure, its mechanics will be explained here, going a little more in detail for the points that are usually too little illustrated. 3.1 The Suffix Tree If x 1 x 2...x i...x n is a sequence, x i x i+1...x n is called a suffix and x 1 x 2...x j a prefix of that sequence, for all i, j = 1 to n. For example, for the sequence MISSISSIPPI, MISS is a prefix of that sequence and IPPI a suffix, and ISSISS is a suffix of the prefix MISSISS. In order to define the suffix tree, it s necessary to introduce the concept of a suffix trie. 10

21 3.1.1 Suffix Trie Each sequence of characters can be represented by a trie, a kind of tree that contains every suffix of the sequence. Consider the sequence BOOK, with suffixes BOOK, OOK, OK and K. Figure 3.1 shows how the suffix trie is constructed. At the start there s only an empty trie, which contains only node 0 (fig. 3.1(a)). Then, beginning at node 0, the biggest suffix is added, one edge and one node per character of the suffix. Each edge is labelled by one character (fig. 3.1(b)). The following suffix, OOK, is also added at node 0, but since it starts with a different character, it defines a new edge, resulting in fig. 3.1(c). The next suffix to be added, OK, has the same start character as a suffix that s already represented, starting with an edge labelled O. So this edge is followed to the next node, where a check is performed to see if it has an edge starting with K. Because this is not the case yet, a new edge labelled K is created (fig. 3.1(d)). Adding the last suffix, K, is easy, and the result is the suffix trie (fig. 3.1(e)). Thanks to this way of constructing, there will never be a node from which two edges with the same character leave. In this suffix trie, every suffix of the sequence can be found by starting at node 0 and walking down the tree. (a) (b) (c) (d) (e) B B O B O B O K O O O O O K O O K O O K O K O K K K K K suffix trie Figure 3.1: Construction of the suffix trie for the sequence BOOK. The most important characteristic of the suffix trie is that one can search for any subsequence of the word by starting at node 0 and following the matches down. If at any moment the correctly matching edge isn t present or the tree just stops, the search stops and this means that the subsequence isn t part of the sequence. The strength of this method is the speed of subsequence searches. If the collected works of Shakespeare are written as a string sequence and its suffix trie is constructed, determining if the word BOOK is part of it can be determined by only performing four character comparisons. Although searching in suffix trees can be done very fast, it might be clear by now that constructing a suffix tree is a task that will require lots of time (and space). Concretely, 11

22 it requires O(N 2 ) time and space, where N is the length of the sequence. This quadratic performance makes it impossible to deal with large subsequences. A method to deal with these problems is using a suffix tree, based on the suffix trie Suffix Tree Definition Suffix trees are compressed tries, which contain all suffixes of a given string sequence. To get a suffix tree, path compression is applied, a method proposed by Edward McCreight in 1976: nodes from which only one edge leaves, are eliminated, so that individual edges in the tree may now represent sequences of text instead of individual characters B O K BOOK O K 1,4 2,2 4,4 O O K OK K 3,4 4,4 O K K BOOK 1234 BOOK 1234 BOOK 1234 suffix trie suffix tree suffix tree (with indices on edges) Figure 3.2: Constructing the suffix tree out of the suffix trie. Figure 3.2 shows how the suffix tree of the sequence BOOK can be obtained from its suffix trie in an intuitive way. The first drawing represents the suffix trie for BOOK, in which all distinct suffixes from the sequence can be found, starting from node 0, and every character in these suffixes labels one edge between the nodes. In the second part, path compression is applied, eliminating nodes with only a single leaving edge. The edges can now represent substrings of characters. All information of the suffix trie is conserved, but less memory is used, due to the eliminating of nodes. The third part of figure 3.2 shows the suffix tree as it is used concretely: the substrings labelling the edges have been replaced by the corresponding start and end indices in the sequence. McCreight s path compression led to the suffix tree data-structure, eliminating a large number of nodes so that time and space complexity are reduced to O(N). This makes the suffix tree a very reasonable structure for sequence problems, requiring only a one-time pre-processing investment. His first algorithm to construct suffix trees had one important disadvantage, namely that the tree had to be built in reverse order, starting with the last characters of the sequence. This ruled out on-line processing, where the tree is constructed at the same time new characters (at the end) of the string are received. 12

23 3.1.3 Mechanics In 1995, Esko Ukkonen proposed an efficient algorithm that allowed on line processing (see [24] or a summarized description as in [13]). His algorithm starts with an empty tree (node 0), then progressively adding each of the N prefixes of the string sequence to the suffix tree. If an extra character is added to the sequence, the tree can be updated by adding the next prefix. The process in which one prefix is added to the suffix tree is called a phase. In each phase, every suffix of the current prefix is added to the tree. First this is done for the longest suffix, and then it works its way down to the shortest suffix, which is the empty string. This way every substring of the sequence will be contained in the tree: a random substring x i...x j (i j) in the sequence will be dealt with when adding suffix x i...x j that is part of the prefix x 1...x j B BO O BOO OO phase 1: phase 2: empty string B phase 3: BO phase 4: BOO Figure 3.3: The four phases in the suffix tree construction for the sequence BOO. In figure 3.3, an example for the sequence BOO is worked out. It shows the four phases of the construction of its suffix tree. Note that these could also be the first four stages in constructing the suffix tree of larger sequences, eg. BOOKKEEPER. In the first phase, the prefix to add to the tree is the empty string, resulting in a tree that contains only node 0. The first non-empty prefix to add to the tree is B, in phase 2. Adding prefix B means adding all its suffixes, starting with the biggest, B, and ending with the smallest, the empty string. To add suffix B, a new edge, labelled B, and a new node, labelled with its number, 1, are created. To add the empty string to the tree, nothing is changed because it is already contained. In phase 3, prefix BO is added, consisting of adding the suffixes BO, O and the empty string. This is done in the same way as it was done in the suffix trie, but now path compression is applied, resulting in an edge representing BO. For O, a new edge is created, because there weren t any edges leaving from node 0 yet that start with O. Adding prefix BOO in phase 4 is done exactly the same way. 13

24 Active Point Until now, updating the tree was easy. The only type of updates were creation of a new edge from node 0 and the extension of these edges. Suppose now the suffix tree for the sequence BOOKKEEPER is to be constructed. While further updating the already constructed tree for BOO with the remaining prefixes for BOOKKEEPER, a third type of update is needed. An example of this update is illustrated in figure 3.4. At one point in the construction of the tree, the drawing of figure 3.4 is reached, representing the tree for BOOKK. First of all, one can see that this tree contains every possible suffix of every prefix of BOOKK, starting at node 0. Each suffix ends at a node that consists of one of these three types: Leaf node: this is a node from which no edges leave, eg. node 2. Explicit node: a non-leaf node at a point in the tree where two or more edges part ways, eg. node 3. Implicit node: a position in the middle of a node. Those are nodes that appeared in the suffix trie, but due to the path compression, they don t represent nodes in the suffix tree, eg. the node in the middle of KK, between node 0 and node 5. 0 BOOKK O 1 3 OKK 2 KK 4 KK 5 phase 6: BOOKK Figure 3.4: Suffix tree for BOOKK. Updating the tree means visiting each of the suffixes in the existing tree and adding the next character in the sequence to the end the suffix. This can be done in three ways: 1. Adding a new edge with a new node to node Simple extension of an edge. 3. Creation of a new node by turning an implicit node into an explicit node and adding an edge with a leaf node attached. 14

25 The third type of extension happens for example when the tree for the prefix BOOKK is updated with a new prefix, BOOKKE. The suffix K from BOOKK ends in an implicit node part way down the edge defined by KK, between node 0 and node 5 (see fig. 3.4). When updating a suffix tree, the active point is defined as the position in the prefix where the first suffix starts that doesn t terminate at a leaf node. In this case the active point could be represented as the vertical line in BOOK K, because the suffix K is the first one encountered in BOOKK that doesn t end at a leaf node. This active point then defines the suffix K, and its length is said to be the length of the suffix starting at that point (this also explains the meaning of suffixes that are longer or shorter than the active point, as written in lower in this text.). It corresponds to the implicit node at the middle of the KK-branch in the suffix tree. (a) (b) 0 0 BOOKK O KK BOOKKE O K E 1 3 OKK 5 KK 1 OKKE 3 6 KKE E KE phase 6: BOOKK phase 7: BOOKKE Figure 3.5: Suffix trees for BOOKKE and BOOKKE. The prefix to add to the tree is BOOKKE. Updating the tree means visiting each of the suffixes in the existing tree, and adding the next character, being E, to the end the suffix (see fig. 3.5(a)). At this point there are two kind of suffixes. First are the ones that ended in a leaf node when dealing with the previous prefix. Updating these suffixes is easy, since they ended in a leaf node, and can be done by simply adding E to the string that ends in that node, because of the string compression. These are the suffixes that are dealt with first. Then a suffix that didn t end in a leaf node, K, is reached. This one starts at the active point. To update it, and all the smaller suffixes (in this case only the empty string), the node it ended in is traced, followed by a check if from that node yet starts an edge with the character to add. It results that K ended in an implicit node. Since none of the edges parting from that node begin with the character E (there s only one edge, which starts with K), this node is converted into an explicit node and add an edge labelled E and a new leaf node (fig. 3.5(b)). The same process is repeated for all smaller suffixes, to make sure they re in the tree. In this case the only smaller suffix is the empty string, corresponding to node 0, and since 15

26 there s no edge leaving yet from this node starting with E, a new node is added, labelled Basics Of The Algorithm The suffix tree has some characteristics that allow for a fairly efficient algorithm. The first important trait is this: first a leaf node, always a leaf node. Any node that s created as a leaf will never be given a descendant, it will only be extended through character concatenation. More importantly, every time a new suffix is added to the tree, the edges leading into every leaf node are going to be automatically extended by a single character. That character will be the last character in the new suffix. This makes management of the edges leading into leaf nodes easy. Any time a new leaf node is created, its edge is automatically set to represent all the characters from its starting point to the end of the input text. Even if those characters are unknown, it s certain they will be added to the tree eventually. Because of this, once a leaf node is created, it can can just be forgotten about. If the edge is split later on, its starting point may change, but it will still extend all the way to the end of the input text. This means the only necessary updates are the explicit and implicit nodes at the active point (which defined the first suffix that didn t end in a leaf node). Given this, only the strings from the active point to the empty string would have to be considered, testing each node for update eligibility. However, some time can be saved by stopping the update earlier. When walking through the suffixes, a new edge will be added to each node that doesn t have a descendant edge starting with the correct character. When finally a suffix is reached that corresponds to a node that has the correct character as a descendant, the update can stop, because all smaller suffixes have been updated the same way at a previous phase. The obvious conclusion is that, if a certain character is found as a descendant of a particular suffix, it s bound to be a descendant of every smaller suffix. End Point When adding a new prefix to the tree, the end point is the position where the first matching descendant is found, i.e. a suffix that ends in a node. Every suffix equal to or smaller than this point is already contained in the tree, meaning that these suffixes mark repetitions in the sequence, a useful argument for the next paragraph, where the algorithm will have to look for repetitions. The end point has an important extra feature that makes it particularly useful. Since leaf nodes were being added to every suffix between the active point and the end point, every suffix longer than the end point will end in a leaf node after the update. This means that the end point will turn into the active point on the next pass over the tree. 16

27 ACTGATTGGCTGGCTGGCTGA end point active point Figure 3.6: Active and end point divide the prefix in three zones Example: Active and End Point With the introduction of the end point, the prefix can now be divided in three zones (see fig. 3.6). Suppose the suffix tree for the sequence ACTGATTGGCTGGCTGGCTGA (a quasi-random sequence that contains a few repetitions) is being constructed, and the tree corresponding to the sequence ACTGATTGGCTGGCTGGCTG is already done. In that case the active and end point will both be found at the same location, as follows: ACTGATTGGC TGGCTGGCTG. If now the following prefix is added to the tree, the current active point changes into the previous end point (in this case its position stays the same), and the current end point is set to ACTGATTGGCTGGCTGG CTGA, since the first suffix that s not contained yet in the tree is CTGA, thus dividing the prefix in three zones: zone 1: Start characters for suffixes that end in leaf nodes and only need their last edge extended with the new character A. zone 2: Start characters for suffixes that are not contained yet in the tree and will therefore cause creation of new edges and leaf nodes. zone 3: Start characters for suffixes that are already contained in the tree. These suffixes end at non-leaf nodes and need no update. All suffixes starting in zone 3 mark repetitions in the sequence. By confining the updates to the suffixes of zone 2, a lot less preprocessing is required to update the tree. And by keeping track of the end point, the position of the active point in the following update is automatically known Suffix Pointer When navigating through the tree, something that could be implemented in a quite efficient way is looking for the node corresponding to the next smaller suffix. If done simply by walking down the tree until the correct node is found, the algorithm isn t going to run in linear time. To get around this, a so-called suffix pointer (see fig. 3.7) is introduced. This is a pointer found at each internal node, which points to the node that is the first suffix 17

28 AB 4 ABC 8 0 AB 1 B 2 C 5 AB 6 C 7 C ABC C C 3 Figure 3.7: The suffix pointers in the suffix trees for the string ABABABC, represented as arrows. Each suffix pointer starts at an explicit node and points from one suffix to the next one. The suffix pointer at node 4, for example, points from suffix ABAB to suffix BAB. of that string 1. So if a particular node represents a string containing characters 1 through N of the input text, the suffix pointer for that node will point to the node that is the termination point for the string starting at the root that represents characters 2 through N of the input text. The suffix pointers are built at the same time the update to the tree is taking place. When moving from the active point to the end point, the father node of each of the newly created leaves is remembered. Every creation of a new edge goes together with the creation of a suffix pointer from the father node of the last created leaf edge to the current father edge. Obviously, this can t be done for the first edge created in the update (since no previous leaf edges were added in the current phase), so for this one walking through the tree will still be required Source code Since the code for string searches in sequences using Ukkonen s algorithm is available as open source on the internet, there was no need to rewrite this part. The hereby presented program to look for repetitions and complementary reversals is based on the ANSI C implementation of the suffix tree written by Dotan Tsadok for his undergraduate project in Haifa University, Israel, in August In this program, the Ukkonen algorithm is implemented as described in [7]. Updating of the tree with one prefix (one phase) is done by running the function SPA, the single phase algorithm. For this prefix, every suffix is added to the tree with the function SEA, the single extension algorithm. In pseudo code, this looks like: 1 In practice, only the suffix pointers for explicit nodes are kept track of. For implicit nodes, suffix pointer lookup happens by tracing the father node s suffix pointer. The correct suffix is then found by walking down the correct edge starting at the node that the suffix pointer pointed to. For example, determining the suffix pointer for BABA in figure 3.7 consists of tracing its father node, 6, following the suffix pointer that starts there, to node 1, and walking down the correct edge (towards node 4) until ABA is encountered. 18

29 SEA(current_suffix) { test_char = last_char in new_suffix; follow current node s suffix link; if (suffix link ends at an explicit node) { if (the node has no descendant edge starting with test_char) create new leaf edge starting at the explicit node; else current phase is done; } else { if (the implicit node s next char isn t test_char) { split the edge at the implicit node; create new leaf edge starting at the split in the edge; } else current phase is done; } } Whenever a new node is created, the function SEA is called again for the next suffix. When the first suffix is reached that is already contained in the tree, the end point is set there, and the next prefix is dealt with (next phase). 3.2 Algorithm for Finding Repetitions Suppose now repetitions in the sequence are important, and the algorithm is to be extended to look for them. Constructing the suffix tree first would spill any information about repetitions, since a repetition in the sequence was actually a suffix that was equal to or shorter than the end pointer, and these suffixes are ignored because they are yet part of the tree. To detect repetitions, the algorithm is changed so that, in every phase, it looks what happens behind the end point. Making this extension time-efficient is not very obvious because the strength of the Ukkonen algorithm was precisely ignoring the suffixes behind the end point. Visiting the suffixes is done by following suffix pointers. The point at which a matching ascendant is found is set as the end point. The current end point is stored as a repetition, and the previous suffix this repetition it is equal to can be found as the one that starts at node 0 and ends in the node corresponding to the end point. All suffixes smaller than the end point are also repetitions. These suffixes can be found by following the remaining suffix pointers, and they are equal to a previous suffix that starts at node 0 and ends in 19

Figure 1. The Suffix Trie Representing "BANANAS".

Figure 1. The Suffix Trie Representing BANANAS. The problem Fast String Searching With Suffix Trees: Tutorial by Mark Nelson http://marknelson.us/1996/08/01/suffix-trees/ Matching string sequences is a problem that computer programmers face on a regular

More information

: Intro Programming for Scientists and Engineers Assignment 3: Molecular Biology

: Intro Programming for Scientists and Engineers Assignment 3: Molecular Biology Assignment 3: Molecular Biology Page 1 600.112: Intro Programming for Scientists and Engineers Assignment 3: Molecular Biology Peter H. Fröhlich phf@cs.jhu.edu Joanne Selinski joanne@cs.jhu.edu Due Dates:

More information

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven) BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling Colin Dewey (adapted from slides by Mark Craven) 2007.04.12 1 Modeling RNA with Stochastic Context Free Grammars consider

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

11/5/13 Comp 555 Fall

11/5/13 Comp 555 Fall 11/5/13 Comp 555 Fall 2013 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Phenotypes arise from copy-number variations Genomic rearrangements are often associated with repeats Trace

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Visualization of Secondary RNA Structure Prediction Algorithms

Visualization of Secondary RNA Structure Prediction Algorithms San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2006 Visualization of Secondary RNA Structure Prediction Algorithms Brandon Hunter San Jose State University

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Dynamic Programming (cont d) CS 466 Saurabh Sinha

Dynamic Programming (cont d) CS 466 Saurabh Sinha Dynamic Programming (cont d) CS 466 Saurabh Sinha Spliced Alignment Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the

More information

Sept. 9, An Introduction to Bioinformatics. Special Topics BSC5936:

Sept. 9, An Introduction to Bioinformatics. Special Topics BSC5936: Special Topics BSC5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 9, 2003 The Dot Matrix Method Steven M. Thompson Florida State

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

The Dot Matrix Method

The Dot Matrix Method Special Topics BS5936: An Introduction to Bioinformatics. Florida State niversity The Department of Biological Science www.bio.fsu.edu Sept. 9, 2003 The Dot Matrix Method Steven M. Thompson Florida State

More information

Determining gapped palindrome density in RNA using suffix arrays

Determining gapped palindrome density in RNA using suffix arrays Determining gapped palindrome density in RNA using suffix arrays Sjoerd J. Henstra Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Abstract DNA and RNA strings contain

More information

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D. Dynamic Programming Course: A structure based flexible search method for motifs in RNA By: Veksler, I., Ziv-Ukelson, M., Barash, D., Kedem, K Outline Background Motivation RNA s structure representations

More information

15.4 Longest common subsequence

15.4 Longest common subsequence 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible

More information

Biostatistics and Bioinformatics Molecular Sequence Databases

Biostatistics and Bioinformatics Molecular Sequence Databases . 1 Description of Module Subject Name Paper Name Module Name/Title 13 03 Dr. Vijaya Khader Dr. MC Varadaraj 2 1. Objectives: In the present module, the students will learn about 1. Encoding linear sequences

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms

DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms Attiya Mahmood, Nazia Islam, Dawit Nigatu, and Werner Henkel Jacobs University Bremen Electrical Engineering and Computer Science Bremen,

More information

15.4 Longest common subsequence

15.4 Longest common subsequence 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

This book is licensed under a Creative Commons Attribution 3.0 License

This book is licensed under a Creative Commons Attribution 3.0 License 6. Syntax Learning objectives: syntax and semantics syntax diagrams and EBNF describe context-free grammars terminal and nonterminal symbols productions definition of EBNF by itself parse tree grammars

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Motif Discovery using optimized Suffix Tries

Motif Discovery using optimized Suffix Tries Motif Discovery using optimized Suffix Tries Sergio Prado Promoter: Prof. dr. ir. Jan Fostier Supervisor: ir. Dieter De Witte Faculty of Engineering and Architecture Department of Information Technology

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Barry Strengholt Matthijs Brobbel Delft University of Technology Faculty of Electrical Engineering, Mathematics

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

2.2 Syntax Definition

2.2 Syntax Definition 42 CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

A Comparative Study of Linear Encoding in Genetic Programming

A Comparative Study of Linear Encoding in Genetic Programming 2011 Ninth International Conference on ICT and Knowledge A Comparative Study of Linear Encoding in Genetic Programming Yuttana Suttasupa, Suppat Rungraungsilp, Suwat Pinyopan, Pravit Wungchusunti, Prabhas

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

8/19/13. Computational problems. Introduction to Algorithm

8/19/13. Computational problems. Introduction to Algorithm I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Properties of Biological Networks

Properties of Biological Networks Properties of Biological Networks presented by: Ola Hamud June 12, 2013 Supervisor: Prof. Ron Pinter Based on: NETWORK BIOLOGY: UNDERSTANDING THE CELL S FUNCTIONAL ORGANIZATION By Albert-László Barabási

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

2) NCBI BLAST tutorial   This is a users guide written by the education department at NCBI. Web resources -- Tour. page 1 of 8 This is a guided tour. Any homework is separate. In fact, this exercise is used for multiple classes and is publicly available to everyone. The entire tour will take

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS EDITED BY Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland B. F.

More information

TABLES AND HASHING. Chapter 13

TABLES AND HASHING. Chapter 13 Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13

More information

A tree-structured index algorithm for Expressed Sequence Tags clustering

A tree-structured index algorithm for Expressed Sequence Tags clustering A tree-structured index algorithm for Expressed Sequence Tags clustering Benjamin Kumwenda 0408046X Supervisor: Professor Scott Hazelhurst April 21, 2008 Declaration I declare that this dissertation is

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components.

In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components. 1 In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components. 2 Starting from a biological motivation to annotate

More information

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Lecture 6: Suffix Trees and Their Construction

Lecture 6: Suffix Trees and Their Construction Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Towards a Weighted-Tree Similarity Algorithm for RNA Secondary Structure Comparison

Towards a Weighted-Tree Similarity Algorithm for RNA Secondary Structure Comparison Towards a Weighted-Tree Similarity Algorithm for RNA Secondary Structure Comparison Jing Jin, Biplab K. Sarker, Virendra C. Bhavsar, Harold Boley 2, Lu Yang Faculty of Computer Science, University of New

More information

Formal languages and computation models

Formal languages and computation models Formal languages and computation models Guy Perrier Bibliography John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman - Introduction to Automata Theory, Languages, and Computation - Addison Wesley, 2006.

More information

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel Breeding Guide Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel www.phenome-netwoks.com Contents PHENOME ONE - INTRODUCTION... 3 THE PHENOME ONE LAYOUT... 4 THE JOBS ICON...

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

CSE 111 Bio: Program Design I Lecture 13: BLAST, while loops. Bob Sloan (CS) & Rachel Poretsky (Bio) University of Illinois, Chicago October 10, 2017

CSE 111 Bio: Program Design I Lecture 13: BLAST, while loops. Bob Sloan (CS) & Rachel Poretsky (Bio) University of Illinois, Chicago October 10, 2017 CSE 111 Bio: Program Design I Lecture 13: BLAST, while loops Bob Sloan (CS) & Rachel Poretsky (Bio) University of Illinois, Chicago October 10, 2017 Grace Hopper Celebration of Women in Computing Apply

More information

Presentation of the book BOOLEAN ARITHMETIC and its Applications

Presentation of the book BOOLEAN ARITHMETIC and its Applications Presentation of the book BOOLEAN ARITHMETIC and its Applications This book is the handout of one Post Graduate Discipline, offered since 1973, named PEA - 5737 Boolean Equations Applied to System Engineering,

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Eval: A Gene Set Comparison System

Eval: A Gene Set Comparison System Masters Project Report Eval: A Gene Set Comparison System Evan Keibler evan@cse.wustl.edu Table of Contents Table of Contents... - 2 - Chapter 1: Introduction... - 5-1.1 Gene Structure... - 5-1.2 Gene

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas First of all connect once again to the CBS system: Open ssh shell client. Press Quick

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Ph.D. in Computer Science (

Ph.D. in Computer Science ( Computer Science 1 COMPUTER SCIENCE http://www.cs.miami.edu Dept. Code: CSC Introduction The Department of Computer Science offers undergraduate and graduate education in Computer Science, and performs

More information

Data structures for string pattern matching: Suffix trees

Data structures for string pattern matching: Suffix trees Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems

More information

Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017

Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Ling 473 Project 4 Due 11:45pm on Thursday, August 31, 2017 Bioinformatics refers the application of statistics and computer science to the management and analysis of data from the biosciences. In common

More information

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL

More information

EULERIAN GRAPHS AND ITS APPLICATIONS

EULERIAN GRAPHS AND ITS APPLICATIONS EULERIAN GRAPHS AND ITS APPLICATIONS Aruna R 1, Madhu N.R 2 & Shashidhar S.N 3 1.2&3 Assistant Professor, Department of Mathematics. R.L.Jalappa Institute of Technology, Doddaballapur, B lore Rural Dist

More information

The affix array data structure and its applications to RNA secondary structure analysis

The affix array data structure and its applications to RNA secondary structure analysis Theoretical Computer Science 389 (2007) 278 294 www.elsevier.com/locate/tcs The affix array data structure and its applications to RNA secondary structure analysis Dirk Strothmann Technische Fakultät,

More information

A Seeded Genetic Algorithm for RNA Secondary Structural Prediction with Pseudoknots

A Seeded Genetic Algorithm for RNA Secondary Structural Prediction with Pseudoknots San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 A Seeded Genetic Algorithm for RNA Secondary Structural Prediction with Pseudoknots Ryan Pham San

More information

Tutorial 4 BLAST Searching the CHO Genome

Tutorial 4 BLAST Searching the CHO Genome Tutorial 4 BLAST Searching the CHO Genome Accessing the CHO Genome BLAST Tool The CHO BLAST server can be accessed by clicking on the BLAST button on the home page or by selecting BLAST from the menu bar

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02) Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture - 01 Merge Sort (Refer

More information

12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006

12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3 Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X. Chen,

More information

Walking with Euler through Ostpreußen and RNA

Walking with Euler through Ostpreußen and RNA Walking with Euler through Ostpreußen and RNA Mark Muldoon February 4, 2010 Königsberg (1652) Kaliningrad (2007)? The Königsberg Bridge problem asks whether it is possible to walk around the old city in

More information

dr.ir. D. Hiemstra dr. P.E. van der Vet

dr.ir. D. Hiemstra dr. P.E. van der Vet dr.ir. D. Hiemstra dr. P.E. van der Vet Abstract Over the last 20 years genomics research has gained a lot of interest. Every year millions of articles are published and stored in databases. Researchers

More information

(DNA#): Molecular Biology Computation Language Proposal

(DNA#): Molecular Biology Computation Language Proposal (DNA#): Molecular Biology Computation Language Proposal Aalhad Patankar, Min Fan, Nan Yu, Oriana Fuentes, Stan Peceny {ap3536, mf3084, ny2263, oif2102, skp2140} @columbia.edu Motivation Inspired by the

More information

6.00 Introduction to Computer Science and Programming Fall 2008

6.00 Introduction to Computer Science and Programming Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.00 Introduction to Computer Science and Programming Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

ECE15: Lab #3. Problem 1. University of California San Diego ( 1) + x4. + x8 + (1)

ECE15: Lab #3. Problem 1. University of California San Diego ( 1) + x4. + x8 + (1) University of California San Diego ECE15: Lab #3 This lab relates specifically to the material covered in Lecture Units 6 and 7 in class, although it assumes knowledge of the previous Lecture Units as

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points)

7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points) 7.36/7.91/20.390/20.490/6.802/6.874 PROBLEM SET 3. Gibbs Sampler, RNA secondary structure, Protein Structure with PyRosetta, Connections (25 Points) Due: Thursday, April 3 th at noon. Python Scripts All

More information

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of

More information

Object-oriented Compiler Construction

Object-oriented Compiler Construction 1 Object-oriented Compiler Construction Extended Abstract Axel-Tobias Schreiner, Bernd Kühl University of Osnabrück, Germany {axel,bekuehl}@uos.de, http://www.inf.uos.de/talks/hc2 A compiler takes a program

More information

Genome Browsers - The UCSC Genome Browser

Genome Browsers - The UCSC Genome Browser Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,

More information

(Refer Slide Time: 1:40)

(Refer Slide Time: 1:40) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 3 Instruction Set Architecture - 1 Today I will start discussion

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame 1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from

More information

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Introduction to Phylogenetics Week 2. Databases and Sequence Formats Introduction to Phylogenetics Week 2 Databases and Sequence Formats I. Databases Crucial to bioinformatics The bigger the database, the more comparative research data Requires scientists to upload data

More information

A Revised Algorithm to find Longest Common Subsequence

A Revised Algorithm to find Longest Common Subsequence A Revised Algorithm to find Longest Common Subsequence Deena Nath 1, Jitendra Kurmi 2, Deveki Nandan Shukla 3 1, 2, 3 Department of Computer Science, Babasaheb Bhimrao Ambedkar University Lucknow Abstract:

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

MacVector for Mac OS X

MacVector for Mac OS X MacVector 11.0.4 for Mac OS X System Requirements MacVector 11 runs on any PowerPC or Intel Macintosh running Mac OS X 10.4 or higher. It is a Universal Binary, meaning that it runs natively on both PowerPC

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Using Manhattan distance and standard deviation for expressed sequence tag clustering. Dane Kennedy Supervisor: Scott Hazelhurst

Using Manhattan distance and standard deviation for expressed sequence tag clustering. Dane Kennedy Supervisor: Scott Hazelhurst Using Manhattan distance and standard deviation for expressed sequence tag clustering Dane Kennedy Supervisor: Scott Hazelhurst October 25, 2010 Abstract An explosion of genomic data in recent years has

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

This chapter is intended to take you through the basic steps of using the Visual Basic

This chapter is intended to take you through the basic steps of using the Visual Basic CHAPTER 1 The Basics This chapter is intended to take you through the basic steps of using the Visual Basic Editor window and writing a simple piece of VBA code. It will show you how to use the Visual

More information