New Implementation for the Multi-sequence All-Against-All Substring Matching Problem
|
|
- Tiffany Wright
- 5 years ago
- Views:
Transcription
1 New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of Victoria Department of Computer Science September 19, 2007 Abstract The threshold all-against-all problem in approximate string matching has important applications in bioinformatics. For two input sequences, an algorithm has been developed which depends linearly on the sizes of each of the sequences. However, this algorithm doesn t extend efficiently to more than two sequences in a straightforward manner. After modifying the problem somewhat, we have developed a fast algorithm which reports patterns unusually conserved between any number of input sequences, and also returns all of their approximate matches.
2 Table of Contents 1 Introduction 2 2 Previous work and possible improvements Algorithm for the threshold all-against-all problem Description Possible improvements The multiple-sequence version of the all-against-all problem Description Possible improvements 6 3 A different implementation for multiple-sequence all-against-all matching The problem redefined Using pattern against text matching as a subroutine Returning candidates and their matches in multiple-sequence comparison Left to complete 12 4 Additional features desired in the program 12 5 Conclusion 13 References 14 1 Introduction Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem [1]. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. She solves it using a method referred as All Paths Below Threshold, or APBT. Her algorithm scales well with the sizes of the sequences and the number of differences. 2
3 This paper will present her implementation of solving the multiple-sequence version of the problem, also using APBT. Our main aim is to find a different approach to solving this problem, and provide an implementation for it. 2 Previous work and possible improvements 2.1 Algorithm for the threshold all-against-all problem Description Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. Previous approaches to solving this problem exactly were either based on dynamic programming or used suffix trees to avoid recomputing matches. The most efficient of the attempts is the one published by Baeza-Yates and Gonnet, in [4]. Barsky s algorithm, developed in conjunction with the University of Victoria Computer Science professors Ulrike Stege and Alex Thomo, as well as Pr. Chris Upton, Biochemistry, aimed to improve on their worst-case running time of O(M 2 N 2 ) [1]. Her algorithm runs in O(MNK 3 ). The problem solved by M. Barsky, which she calls all error-bounded approximate matches is: given two strings s and t, of lengths M and N, find all pairs of substrings (s[i,j], t[k,l]), such that the length of both substrings is greater than S, and the edit distance between them is at most K. s[i,j], t[k,l] are referred to K-approximate matches. The matches reported are maximal, meaning that if (s[i,j], t[k,l]) is a solution, then there is no valid solution (s, t ) with s being a superstring of s[i,j] and t = t[k,l], or vice versa. 3
4 Barsky considers a matching matrix m which has s as one dimension, t as the second dimension, and m[i,j] = 1 iff s[i] = t[j]. The matrix m is used to induce digraph G m, with each vertex v ij corresponding to 1 entry in the matrix (m[i,j] = 1). G m contains all edges (v ij,v kl ) such that i < k and j < l. The graph is searched for paths, call them P(v ij v kl ) such that have a match length greater than S, and an error number of at most K. The match length of P is related to the difference between the coordinates of v ij and v kl, and is equal to min(k-i+1, l-j+1). The error number of P is the sum of the costs of all the edges it contains. The cost of an edge from v ab to v cd is max(c-a, d-b) -1), corresponding to the minimum number of edit operations required to transform one of the substrings, say s[a,c], into the other, say t[b,d]. Here, an edit operation can be an insertion or deletion of one character, or a substitution between two characters. Hence, the error number of the path corresponds to the cost of an edit transcript between s[i,k] and t[j,l]). Barsky shows that the error number of the shortest path between any two vertices in Gm corresponds to the actual edit distance between the two corresponding substrings. This means that the problem of finding matches between strings s and t can be reduced to finding all maximal paths of match length at least S, with error number of at most K Possible improvements To solve the all paths below the threshold problem, Barsky doesn t build the entire matrix m or the graph G m at once, but computes values from them as needed. Also, information already computed about paths is used to determine whether to consider new paths. For more details on this, and for a proof that the algorithm runs in O(MNK 3 ), see section 4 of Barsky s paper [1]. Barsky suggested to me that the implementation could be sped up if it was run in parallel on several processors. This can be done because sets of adjacent rows in the matrix can be processed independently. Hence, given s 1, s 2 and a chunk size of e.g columns, a central processor could have auxiliary processors run APBT on s 2 and a chunk of s 1, and then combine the results from different parts of the matrix into the complete solution for s 1 and s 2. According to Barsky, for highly similar 4
5 sequences, APBT can take as much as 20 hours to return a result, and parallelizing the algorithm could reduce the running time by a significant factor [2]. 2.2 The multiple-sequence version of the all-against-all problem Description Chris Upton s virology lab is interested in predicting unusually conserved short regions of DNA or protein, given large input sequences. However, because of the nature of their research, they are more interested in regions that are unusually similar between a number of organisms, not just two; therefore, it is very important to provide them with an efficient program that takes several sequences as input [3]. The multiple-sequence version of the all-against-all problem can be defined as: given strings s 1, s 2,, s m, find all matches (s 1, s 2,., s m ), such the substrings are all of length S and the edit distance between any two is always at most K. M. Barsky tried to use the existing implementation of APBT to get a method that would solve the multiple-sequence version of the problem exactly. It involves a filtering step in which the all-paths-below-threshold (APBT) algorithm is essentially run on all pairs of the given strings to get a set of potential start positions of matches in each of the strings. For each of the start positions from a certain chosen sequence, Barsky builds all the matches which include this start position as branches in a tree rooted at the original start position [2]. The algorithm breaks down to: run APBT on s 1, s 2 o mark all possible match start positions in s 1 and s 2 starting only at the marked start positions o run APBT on s 1, s 3 eliminate the candidate start positions from s 1 with no matches o run APBT on s 2, s 3 - eliminate the candidate start positions from s 2 with no matches o continue doing this until all pairs of sequences have been compared; 5
6 o then repeat the three previous steps on all pairs of sequences, eliminating more candidate start positions, until a run of APBT yields a less than 10- fold reduction in the number of candidates recover the candidate patterns o perform APBT only on the marked start positions in s 1, s 2 o for each match reported by APBT from s 1, run APBT on it and s 2, then add the resulting matches as children of the match in a tree run APBT on the match and s 3, and for each match, add it as a child of one of the matches added from s 2, iff their edit distance is less than K repeat this for each sequence up to s m. o for every tree, report all the distinct paths from root to leaf as solution tuples: (s 1, s 2,., s m ) Possible improvements Marina encountered a problem when running her algorithm on 4 or more biologically related sequences. Namely, the memory required to build all of the trees is such that the intermediate information, even for one tree, can be larger than the amount of memory available in the average lab. The program is heavily slowed down by writing intermediate information on disk, then retrieving it [2]. Another issue is that the filter is very time consuming, since it applies APBT an order of n 2 times. Also APBT itself can be quite time-consuming for long, relatively similar sequences. My project focuses mostly on finding a different approach to solving the multiple-sequence version of threshold allagainst-all problem. We aim to reduce the memory requirements, since they slow down the program. Another goal is to reduce the number of times APBT is run, since it can be very slow for similar sequences. APBT can also likely be sped up by using faster ways of comparing candidates against sequences, as opposed to using APBT as a subroutine. Much of the information stored in the trees described above is redundant: if for instance s 31 is present in both subtrees rooted at s 21 and s 22, and s 31 differs from both s 21 and s 22 at the same position, then the entire subtree rooted at s 31 could appear twice. Also, many of the candidates added to the tree at the sequence 3 level could be eliminated at a later point 6
7 in the algorithm (this happens if, e.g., s 31 has no match in sequence m). Hence, to save memory we could store potential matches and their edges in a graph until all the edges are computed, and then the graph can be used in recovering the actual solution tuples. To build this new graph G, proceed as in the pseudocode for the all-against-all problem, only adding a substring r ij at the i th level in the graph if and only if it has edges to every single level from 1 to i-1; this is essentially building the trees above as a graph, but without storing the paths separately; then, the graph can be processed to obtain all the valid solution tuples as induced complete subgraphs in G, i.e.: sets of m vertices, one from each level of the graph, such that every vertex in the set has edges to every single other vertex in the set. 3 A different implementation for multiple-sequence all-againstall matching 3.1 The problem redefined C. Upton told us that he would like results from the multiple-sequence version in useful time, even if the results reported are a superset of the actual solution to the multiple allagainst-all problem. What he specifically wants is a set of patterns, say taken from s 1, which are unusually conserved in all of the sequences. He doesn t necessarily need for any two substrings in the set of matches to each pattern to be within K differences from each other. Hence, as long as the number of the patterns is low enough, he can use the positions of their matches in all of the other sequences to figure which are significant. He especially insisted that the current implementation is too slow to be usable by his lab, so he would like a fast algorithm, as long as no potential solutions are left out [3]. The idea behind the implementation done in the current project is to use APBT to get a set of candidate patterns. Then, we aim to quickly determine the candidates which have matches in all of the sequences and what these matches are. We examined how APBT performs with inputs that would be common to Upton s lab, so that we know the number 7
8 of candidate patterns that will have to be matched against the other sequences. The test sequences were Mycobacteriophage D29 and Mycobacteriophage Bxb1 taken from Barsky s paper [1], related viral genomes of around base pairs; several values of S and K were attempted, with K around 10% of S, as suggested by Upton. Table 1: Tests of APBT on two viral sequences S K Number of solutions Processing time (min.) The strategy for solving the multiple-sequence version is to use APBT on two sequences to get candidate patterns and then compare the candidates to the rest of the sequences using fast pattern-text matching. This can be used to efficiently report a set of matches for each candidate, with at least one match from each sequence. Since all the matches in any one set have an edit distance of K to original pattern in the set, they are guaranteed to have at most 2K differences between each other. After this implementation, if the users require it, we can further process each set to produce sets of tuples, with one substring from each sequence. This will guarantee that any two substrings have an edit distance of at most K. 3.2 Using pattern against text matching as a subroutine We based the pattern-text matching as first described by Baeza-Yates and Gonnet in A New Approach to Text Searching [5]. This algorithm takes advantage of the small size of the pattern to use computer words, bit shifts and logical bitwise operations to determine where matches to the pattern end in the text. For the sizes we re interested in, the pattern can be encoded as several bit masks: there would be one bit mask for each 8
9 letter in the pattern, and a 1 bit would signify that the letter appears at that position. Note that we used Wu and Manber s extension of this algorithm [6] to also match the pattern to the text with up to k single-letter insertions, deletions, or substitutions. The algorithm is given a pattern P, a text T, and values for parameters k and S and then determines all the end positions in T of matches to P with up to k errors. First, we compute a set of bit masks U, one for each character in the text. Then, we compute a set of matrices R 1, R k. Every column in R represents a different position in the text; a bit R d (i, j) is 1 iff a there is a match to the prefix P[1..i] ending at T(j), with d allowed differences. Column j of R d depends on: the column j-1 in R d, the bit mask for character T(j), and some columns in R d-1. The recurrence relation to compute the j+1 th column of R d is: R d (j) = (d 1s) if j = 0 R d (j) = [ Bit-Shift(R d (j-1)) AND U(T(j)) ] if j >= 1 OR Bit-Shift( R d-1 (j-1) ) OR Bit-Shift( R d-1 (j) ) OR R d-1 (j-1) Note that the final bit in any column or R k (j) indicates whether or not there is a match to P ending at T(j) with up to k differences. Since computing a column in R d (j) involves looking up R d (j-1), R d-1 (j), R d-1 (j) and j could be of the order of tens of thousands, we don t store any entire R d table. We consider instead R d (j) values to be cells in a table, with d as one dimension and j as the other. We compute values in the table diagonal by diagonal, storing a position j as a solution when appropriate. In this way, we only need to store the two previous diagonals when computing a new one, which are each at most d values. We also represent the actual values of R d (j) as 64-bit longs; this limits the size of a pattern to search for up to 64 characters, long enough for C. Upton s purposes. The maximum length can easily be changed to 128 characters in the future. For a more complete specification of the algorithm, see the program code and its comments. We tested the implementation of this algorithm for speed. We chose to test patterns of different lengths S and with different K values against the Human coronavirus 229E sequence, which has a size of about base pairs. The worst running time found in 9
10 the trials, using S as 22 and an exaggerated value for K of 9 allowed differences was under 0.2 seconds. These results were promising because a program for all-against-all comparison could use the shift-and as a subroutine to quickly filter out candidate patterns which don t have matches in all the sequences. The pattern-against-text matching could also speed up the substring-against-substring comparisons done in the existing APBT multi-sequence implementation. Table 2: Performance of pattern-against-text matching pattern length allowed differences number of processing time (ms) (chars) solutions Returning candidates and their matches in multiple-sequence comparison The strategy for finding candidates and their matches is as follows: Use APBT to find the candidate matches between 2 preferred sequences (s 1, s 2 ); For each of these candidates as taken from the first sequence s 1 : o o for each of the sequences s 2 to s m, get the end positions of the matches to the candidate; store these in a map, indexed by the candidate; if in any of the sequences, the candidate has 0 matches, remove the candidate from consideration; Report each candidate along with its set of matches from each sequence, after post-processing to find the start positions as well as the end positions of the matches. Pattern-text matching reports only the end positions of the matches. However, the exact coordinates of every match have to be reported in order for the users to visualize the results. After any set of end positions is returned by pattern matching, we can compute the starting positions of the matches and then store the matches as pairs of start and end 10
11 positions. Since the start points have to be between (end position S k) and (end position S + k), we can just test every possible start point. There are then up to k possibilities for the starting point, and each could define a separate match, so the number of matches could grow by a factor of k when doing this. We decided to only store the farthest start position for any given end position, so that the users know that the region between this start point and the end point contains at least one match for the given pattern. Here is an example to illustrate what solutions are reported after this processing (note the substrings are given in this example, not their position): given k = 1, S = 4, and sequences t01 CATC, t02 TTTTCATCTT, t03 CGTCGGCATC. The solutions reported are: CATC from t01 TTTTCATCTT and TTTTCATCTT from t02; CGTCGGCATC and CGTCGGCATC from t03 Note that GCATC is reported from t03 and not CATC, because they have the same end point and GCATC is longer. Also note that if S = 3, then CAT would also be reported as a match to CATC, because even if it is contained completely in the reported GCATC, it has a different end point. In testing the program, we comparing running times with and without also determining finding the start points, to see whether that step reduces the benefits of using the fast pattern-text matching. We found that both the number of matches and the running time remain of the same order of magnitude as before finding the start points. First, we used the same sequences as the ones used by Barsky [1] to test the 2-sequence APBT. These are five viral DNA sequences, D1 to D5, and three animal protein sequences, P1 to P3; for a description of the sequences, refer to [1]. The running time of the algorithm was split into the first step, using APBT on the first two sequences input to get candidate patterns, then the second step, processing the patterns to get their matches from all the sequences. Table 3: Performance of the program 11
12 Test sequences S K APBT time (s) Processing time after APBT (s) APBT # candidates # results remaining D1 D4 D2 D3 D P2 P3 P P2 P3 P Left to complete The example above indicates that overlapping solutions can be reported by the program. One useful addition that would make the results more usable by Upton s lab would be to combine any two matches to one candidate from in the same sequence with any overlap between them. Further, we have to store the matches in a fashion that is useful for the scientists using the program. According to C. Upton s team, they would like the output as one file per candidate pattern and its set of matches. This file should be in Fasta format and include for each of the matches: a description line with the identifier of sequence it is from, and the start and end positions; and a data line, with the actual sequence of this match. 4 Additional features desired in the program Upton s lab has used both the APBT program and the M. Barsky s multiple-sequence version. They noticed that when running the APBT program on very long sequences, such as entire viral genome with reasonable value values of S, e.g. 20, the number of matches reported can be very large. This is unavoidable given that both the lengths and similarities of the input sequences can be varied by the user. What the virologists would like is to input several viral genomes, of length around bp, and have the program predict values of S, K, such that the number of hits is less than We decided that a sampling strategy could be used to try to pinpoint desired values of the parameters. This could be built on top of the multiple-sequence implementation of the program. 12
13 5 Conclusion The greatest advantage of the current implementation is the speed with which it reports the patterns which satisfy the desired criteria. The implementation leaves open many possibilities for post-processing to enforce additional criteria on the output. It also can be extended to solve the multiple-sequence threshold all-against-all substring matching problem, by further processing of the sets of matches reported. 13
14 References 1. BARSKY M., STEGE U., THOMO A. and UPTON C. A graph approach to the threshold all-against-all substring matching problem, Submitted to The Journal of Experimental Algorithmics. 2. BARSKY M. Personal communication, University of Victoria. 3. UPTON C. Personal communication, University of Victoria. 4. BAEZA-YATES R. A. and GONNET G. H. All-against-all sequence matching, Report of the Department of Computer Science, Univ. de Chile. 6. BAEZA-YATES R. A. and GONNET G. H. A New Approach to Text Searching, Communications of the ACM, 35: MANBER U. and WU S. Fast text searching with errors, Department of Computer Science, U. of Arizona. 14
An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More informationFastA & the chaining problem
FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,
More informationUNIT 4 Branch and Bound
UNIT 4 Branch and Bound General method: Branch and Bound is another method to systematically search a solution space. Just like backtracking, we will use bounding functions to avoid generating subtrees
More informationFastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:
FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem
More informationLectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures
4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut
More informationEfficient subset and superset queries
Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper
More informationRecursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)
Dynamic Programming Any recursive formula can be directly translated into recursive algorithms. However, sometimes the compiler will not implement the recursive algorithm very efficiently. When this is
More information17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.
17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications
More informationCMPS 102 Solutions to Homework 7
CMPS 102 Solutions to Homework 7 Kuzmin, Cormen, Brown, lbrown@soe.ucsc.edu November 17, 2005 Problem 1. 15.4-1 p.355 LCS Determine an LCS of x = (1, 0, 0, 1, 0, 1, 0, 1) and y = (0, 1, 0, 1, 1, 0, 1,
More informationLAB # 3 / Project # 1
DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises
More informationBIOL591: Introduction to Bioinformatics Alignment of pairs of sequences
BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model
More informationLowest Common Ancestor (LCA) Queries
Lowest Common Ancestor (LCA) Queries A technique with application to approximate matching Chris Lewis Approximate Matching Match pattern to text Insertion/Deletion/Substitution Applications Bioinformatics,
More informationStudy of Data Localities in Suffix-Tree Based Genetic Algorithms
Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the
More informationEvolutionary tree reconstruction (Chapter 10)
Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationConsistency and Set Intersection
Consistency and Set Intersection Yuanlin Zhang and Roland H.C. Yap National University of Singapore 3 Science Drive 2, Singapore {zhangyl,ryap}@comp.nus.edu.sg Abstract We propose a new framework to study
More informationCSE 101, Winter Design and Analysis of Algorithms. Lecture 11: Dynamic Programming, Part 2
CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 11: Dynamic Programming, Part 2 Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ Goal: continue with DP (Knapsack, All-Pairs SPs, )
More informationUsing Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms
Journal of Advances in Information Technology Vol. 7, No. 4, November 016 Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring
More informationCSCE 411 Design and Analysis of Algorithms
CSCE 411 Design and Analysis of Algorithms Set 4: Transform and Conquer Slides by Prof. Jennifer Welch Spring 2014 CSCE 411, Spring 2014: Set 4 1 General Idea of Transform & Conquer 1. Transform the original
More informationNotes for Lecture 24
U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined
More information1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors
1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of
More informationA Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances 1
A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances 1 Heikki Hyyrö Department of Computer and Information Sciences 33014 University of Tampere Finland e-mail: Heikki.Hyyro@uta.fi
More informationIntroduction to Algorithms I
Summer School on Algorithms and Optimization Organized by: ACM Unit, ISI and IEEE CEDA. Tutorial II Date: 05.07.017 Introduction to Algorithms I (Q1) A binary tree is a rooted tree in which each node has
More informationAn Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST
An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationGraph Algorithms Using Depth First Search
Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth
More informationimplementing the breadth-first search algorithm implementing the depth-first search algorithm
Graph Traversals 1 Graph Traversals representing graphs adjacency matrices and adjacency lists 2 Implementing the Breadth-First and Depth-First Search Algorithms implementing the breadth-first search algorithm
More informationCSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming
(2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence
More informationBUNDLED SUFFIX TREES
Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science
More informationIndexing Variable Length Substrings for Exact and Approximate Matching
Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of
More informationGraph and Digraph Glossary
1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationCS161 - Final Exam Computer Science Department, Stanford University August 16, 2008
CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008 Name: Honor Code 1. The Honor Code is an undertaking of the students, individually and collectively: a) that they will
More informationDynamic Programming (cont d) CS 466 Saurabh Sinha
Dynamic Programming (cont d) CS 466 Saurabh Sinha Spliced Alignment Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the
More informationCOMP3121/3821/9101/ s1 Assignment 1
Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not
More informationMarch 20/2003 Jayakanth Srinivasan,
Definition : A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs of distinct elements of V called edges. Definition : In a multigraph G = (V, E) two or
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationBacktracking and Branch-and-Bound
Backtracking and Branch-and-Bound Usually for problems with high complexity Exhaustive Search is too time consuming Cut down on some search using special methods Idea: Construct partial solutions and extend
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationSolving NP-hard Problems on Special Instances
Solving NP-hard Problems on Special Instances Solve it in poly- time I can t You can assume the input is xxxxx No Problem, here is a poly-time algorithm 1 Solving NP-hard Problems on Special Instances
More informationSearch means finding a path or traversal between a start node and one of a set of goal nodes. Search is a study of states and their transitions.
UNIT 3 BASIC TRAVERSAL AND SEARCH TECHNIQUES Search means finding a path or traversal between a start node and one of a set of goal nodes. Search is a study of states and their transitions. Search involves
More informationChapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path.
Chapter 3 Trees Section 3. Fundamental Properties of Trees Suppose your city is planning to construct a rapid rail system. They want to construct the most economical system possible that will meet the
More informationCS/COE 1501 cs.pitt.edu/~bill/1501/ Graphs
CS/COE 1501 cs.pitt.edu/~bill/1501/ Graphs 5 3 2 4 1 0 2 Graphs A graph G = (V, E) Where V is a set of vertices E is a set of edges connecting vertex pairs Example: V = {0, 1, 2, 3, 4, 5} E = {(0, 1),
More informationPrinciples of Bioinformatics. BIO540/STA569/CSI660 Fall 2010
Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed
More informationString Matching Algorithms
String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational
More information15.4 Longest common subsequence
15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible
More informationGraphs. The ultimate data structure. graphs 1
Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely
More informationLecture 3, Review of Algorithms. What is Algorithm?
BINF 336, Introduction to Computational Biology Lecture 3, Review of Algorithms Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Algorithm? Definition A process
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationComputational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh
Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Overlap detection: Semi-Global Alignment An overlap of two sequences is considered an
More informationACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s
Problem A The Best Team Time Limit: 2s ACM-ICPC 2010 is drawing near and your university want to select three out of N students to form the best team. The university however, has a limited budget, so they
More informationBacktracking. Chapter 5
1 Backtracking Chapter 5 2 Objectives Describe the backtrack programming technique Determine when the backtracking technique is an appropriate approach to solving a problem Define a state space tree for
More informationCS 6783 (Applied Algorithms) Lecture 5
CS 6783 (Applied Algorithms) Lecture 5 Antonina Kolokolova January 19, 2012 1 Minimum Spanning Trees An undirected graph G is a pair (V, E); V is a set (of vertices or nodes); E is a set of (undirected)
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationSteven Skiena. skiena
Lecture 12: Examples of Dynamic Programming (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Give an O(n 2 )
More informationModule 5 Graph Algorithms
Module 5 Graph lgorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 97 E-mail: natarajan.meghanathan@jsums.edu 5. Graph Traversal lgorithms Depth First
More informationUSING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT
IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More informationtree follows. Game Trees
CPSC-320: Intermediate Algorithm Design and Analysis 113 On a graph that is simply a linear list, or a graph consisting of a root node v that is connected to all other nodes, but such that no other edges
More informationEE 701 ROBOT VISION. Segmentation
EE 701 ROBOT VISION Regions and Image Segmentation Histogram-based Segmentation Automatic Thresholding K-means Clustering Spatial Coherence Merging and Splitting Graph Theoretic Segmentation Region Growing
More informationCME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.
CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 - Due at the beginning of class May 18th. 1. Download the following materials: Slides: http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf
More informationElements of Graph Theory
Elements of Graph Theory Quick review of Chapters 9.1 9.5, 9.7 (studied in Mt1348/2008) = all basic concepts must be known New topics we will mostly skip shortest paths (Chapter 9.6), as that was covered
More informationEfficient Method for Half-Pixel Block Motion Estimation Using Block Differentials
Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Tuukka Toivonen and Janne Heikkilä Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering
More informationChapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 9 Greedy Technique Copyright 2007 Pearson Addison-Wesley. All rights reserved. Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that
More informationModule 6 NP-Complete Problems and Heuristics
Module 6 NP-Complete Problems and Heuristics Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu P, NP-Problems Class
More informationCS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department
CS473-Algorithms I Lecture 11 Greedy Algorithms 1 Activity Selection Problem Input: a set S {1, 2,, n} of n activities s i =Start time of activity i, f i = Finish time of activity i Activity i takes place
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More information5 Matchings in Bipartite Graphs and Their Applications
5 Matchings in Bipartite Graphs and Their Applications 5.1 Matchings Definition 5.1 A matching M in a graph G is a set of edges of G, none of which is a loop, such that no two edges in M have a common
More informationAccelerating Protein Classification Using Suffix Trees
From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science
More informationBranch-and-bound: an example
Branch-and-bound: an example Giovanni Righini Università degli Studi di Milano Operations Research Complements The Linear Ordering Problem The Linear Ordering Problem (LOP) is an N P-hard combinatorial
More informationSuffix Vector: A Space-Efficient Suffix Tree Representation
Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,
More informationBinary Decision Diagrams
Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table
More informationReport on the paper Summarization-based Mining Bipartite Graphs
Report on the paper Summarization-based Mining Bipartite Graphs Annika Glauser ETH Zuerich Spring 2015 Extract from the paper [1]: Introduction The paper Summarization-based Mining Bipartite Graphs introduces
More informationBackTracking Introduction
Backtracking BackTracking Introduction Backtracking is used to solve problems in which a sequence of objects is chosen from a specified set so that the sequence satisfies some criterion. The classic example
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationRecoloring k-degenerate graphs
Recoloring k-degenerate graphs Jozef Jirásek jirasekjozef@gmailcom Pavel Klavík pavel@klavikcz May 2, 2008 bstract This article presents several methods of transforming a given correct coloring of a k-degenerate
More informationS. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165
S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165 5.22. You are given a graph G = (V, E) with positive edge weights, and a minimum spanning tree T = (V, E ) with respect to these weights; you may
More informationBasic Combinatorics. Math 40210, Section 01 Fall Homework 4 Solutions
Basic Combinatorics Math 40210, Section 01 Fall 2012 Homework 4 Solutions 1.4.2 2: One possible implementation: Start with abcgfjiea From edge cd build, using previously unmarked edges: cdhlponminjkghc
More informationGraphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs
Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory
More informationA Connection between Network Coding and. Convolutional Codes
A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source
More informationTrees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.
Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial
More informationWe augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real
14.3 Interval trees We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real numbers ], with Interval ]represents the set Open and half-open intervals
More informationClustering Using Graph Connectivity
Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the
More informationLecture 25 Spanning Trees
Lecture 25 Spanning Trees 15-122: Principles of Imperative Computation (Fall 2018) Frank Pfenning, Iliano Cervesato The following is a simple example of a connected, undirected graph with 5 vertices (A,
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationMa/CS 6b Class 13: Counting Spanning Trees
Ma/CS 6b Class 13: Counting Spanning Trees By Adam Sheffer Reminder: Spanning Trees A spanning tree is a tree that contains all of the vertices of the graph. A graph can contain many distinct spanning
More informationGreedy Algorithms CHAPTER 16
CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often
More information6. Finding Efficient Compressions; Huffman and Hu-Tucker
6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?
More informationReachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace
CHICAGO JOURNAL OF THEORETICAL COMPUTER SCIENCE 2014, Article 2, pages 1 29 http://cjtcs.cs.uchicago.edu/ Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace Thomas Thierauf Fabian
More informationPROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota
Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein
More informationString Patterns and Algorithms on Strings
String Patterns and Algorithms on Strings Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 11 Objectives To introduce the pattern matching problem and the important of algorithms
More informationDynamic Programming II
June 9, 214 DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? DP: Longest common
More informationBoolean Representations and Combinatorial Equivalence
Chapter 2 Boolean Representations and Combinatorial Equivalence This chapter introduces different representations of Boolean functions. It then discusses the applications of these representations for proving
More informationAlgorithms for Bioinformatics
Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More information