New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

Size: px
Start display at page:

Download "New Implementation for the Multi-sequence All-Against-All Substring Matching Problem"

Transcription

1 New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of Victoria Department of Computer Science September 19, 2007 Abstract The threshold all-against-all problem in approximate string matching has important applications in bioinformatics. For two input sequences, an algorithm has been developed which depends linearly on the sizes of each of the sequences. However, this algorithm doesn t extend efficiently to more than two sequences in a straightforward manner. After modifying the problem somewhat, we have developed a fast algorithm which reports patterns unusually conserved between any number of input sequences, and also returns all of their approximate matches.

2 Table of Contents 1 Introduction 2 2 Previous work and possible improvements Algorithm for the threshold all-against-all problem Description Possible improvements The multiple-sequence version of the all-against-all problem Description Possible improvements 6 3 A different implementation for multiple-sequence all-against-all matching The problem redefined Using pattern against text matching as a subroutine Returning candidates and their matches in multiple-sequence comparison Left to complete 12 4 Additional features desired in the program 12 5 Conclusion 13 References 14 1 Introduction Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem [1]. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. She solves it using a method referred as All Paths Below Threshold, or APBT. Her algorithm scales well with the sizes of the sequences and the number of differences. 2

3 This paper will present her implementation of solving the multiple-sequence version of the problem, also using APBT. Our main aim is to find a different approach to solving this problem, and provide an implementation for it. 2 Previous work and possible improvements 2.1 Algorithm for the threshold all-against-all problem Description Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. Previous approaches to solving this problem exactly were either based on dynamic programming or used suffix trees to avoid recomputing matches. The most efficient of the attempts is the one published by Baeza-Yates and Gonnet, in [4]. Barsky s algorithm, developed in conjunction with the University of Victoria Computer Science professors Ulrike Stege and Alex Thomo, as well as Pr. Chris Upton, Biochemistry, aimed to improve on their worst-case running time of O(M 2 N 2 ) [1]. Her algorithm runs in O(MNK 3 ). The problem solved by M. Barsky, which she calls all error-bounded approximate matches is: given two strings s and t, of lengths M and N, find all pairs of substrings (s[i,j], t[k,l]), such that the length of both substrings is greater than S, and the edit distance between them is at most K. s[i,j], t[k,l] are referred to K-approximate matches. The matches reported are maximal, meaning that if (s[i,j], t[k,l]) is a solution, then there is no valid solution (s, t ) with s being a superstring of s[i,j] and t = t[k,l], or vice versa. 3

4 Barsky considers a matching matrix m which has s as one dimension, t as the second dimension, and m[i,j] = 1 iff s[i] = t[j]. The matrix m is used to induce digraph G m, with each vertex v ij corresponding to 1 entry in the matrix (m[i,j] = 1). G m contains all edges (v ij,v kl ) such that i < k and j < l. The graph is searched for paths, call them P(v ij v kl ) such that have a match length greater than S, and an error number of at most K. The match length of P is related to the difference between the coordinates of v ij and v kl, and is equal to min(k-i+1, l-j+1). The error number of P is the sum of the costs of all the edges it contains. The cost of an edge from v ab to v cd is max(c-a, d-b) -1), corresponding to the minimum number of edit operations required to transform one of the substrings, say s[a,c], into the other, say t[b,d]. Here, an edit operation can be an insertion or deletion of one character, or a substitution between two characters. Hence, the error number of the path corresponds to the cost of an edit transcript between s[i,k] and t[j,l]). Barsky shows that the error number of the shortest path between any two vertices in Gm corresponds to the actual edit distance between the two corresponding substrings. This means that the problem of finding matches between strings s and t can be reduced to finding all maximal paths of match length at least S, with error number of at most K Possible improvements To solve the all paths below the threshold problem, Barsky doesn t build the entire matrix m or the graph G m at once, but computes values from them as needed. Also, information already computed about paths is used to determine whether to consider new paths. For more details on this, and for a proof that the algorithm runs in O(MNK 3 ), see section 4 of Barsky s paper [1]. Barsky suggested to me that the implementation could be sped up if it was run in parallel on several processors. This can be done because sets of adjacent rows in the matrix can be processed independently. Hence, given s 1, s 2 and a chunk size of e.g columns, a central processor could have auxiliary processors run APBT on s 2 and a chunk of s 1, and then combine the results from different parts of the matrix into the complete solution for s 1 and s 2. According to Barsky, for highly similar 4

5 sequences, APBT can take as much as 20 hours to return a result, and parallelizing the algorithm could reduce the running time by a significant factor [2]. 2.2 The multiple-sequence version of the all-against-all problem Description Chris Upton s virology lab is interested in predicting unusually conserved short regions of DNA or protein, given large input sequences. However, because of the nature of their research, they are more interested in regions that are unusually similar between a number of organisms, not just two; therefore, it is very important to provide them with an efficient program that takes several sequences as input [3]. The multiple-sequence version of the all-against-all problem can be defined as: given strings s 1, s 2,, s m, find all matches (s 1, s 2,., s m ), such the substrings are all of length S and the edit distance between any two is always at most K. M. Barsky tried to use the existing implementation of APBT to get a method that would solve the multiple-sequence version of the problem exactly. It involves a filtering step in which the all-paths-below-threshold (APBT) algorithm is essentially run on all pairs of the given strings to get a set of potential start positions of matches in each of the strings. For each of the start positions from a certain chosen sequence, Barsky builds all the matches which include this start position as branches in a tree rooted at the original start position [2]. The algorithm breaks down to: run APBT on s 1, s 2 o mark all possible match start positions in s 1 and s 2 starting only at the marked start positions o run APBT on s 1, s 3 eliminate the candidate start positions from s 1 with no matches o run APBT on s 2, s 3 - eliminate the candidate start positions from s 2 with no matches o continue doing this until all pairs of sequences have been compared; 5

6 o then repeat the three previous steps on all pairs of sequences, eliminating more candidate start positions, until a run of APBT yields a less than 10- fold reduction in the number of candidates recover the candidate patterns o perform APBT only on the marked start positions in s 1, s 2 o for each match reported by APBT from s 1, run APBT on it and s 2, then add the resulting matches as children of the match in a tree run APBT on the match and s 3, and for each match, add it as a child of one of the matches added from s 2, iff their edit distance is less than K repeat this for each sequence up to s m. o for every tree, report all the distinct paths from root to leaf as solution tuples: (s 1, s 2,., s m ) Possible improvements Marina encountered a problem when running her algorithm on 4 or more biologically related sequences. Namely, the memory required to build all of the trees is such that the intermediate information, even for one tree, can be larger than the amount of memory available in the average lab. The program is heavily slowed down by writing intermediate information on disk, then retrieving it [2]. Another issue is that the filter is very time consuming, since it applies APBT an order of n 2 times. Also APBT itself can be quite time-consuming for long, relatively similar sequences. My project focuses mostly on finding a different approach to solving the multiple-sequence version of threshold allagainst-all problem. We aim to reduce the memory requirements, since they slow down the program. Another goal is to reduce the number of times APBT is run, since it can be very slow for similar sequences. APBT can also likely be sped up by using faster ways of comparing candidates against sequences, as opposed to using APBT as a subroutine. Much of the information stored in the trees described above is redundant: if for instance s 31 is present in both subtrees rooted at s 21 and s 22, and s 31 differs from both s 21 and s 22 at the same position, then the entire subtree rooted at s 31 could appear twice. Also, many of the candidates added to the tree at the sequence 3 level could be eliminated at a later point 6

7 in the algorithm (this happens if, e.g., s 31 has no match in sequence m). Hence, to save memory we could store potential matches and their edges in a graph until all the edges are computed, and then the graph can be used in recovering the actual solution tuples. To build this new graph G, proceed as in the pseudocode for the all-against-all problem, only adding a substring r ij at the i th level in the graph if and only if it has edges to every single level from 1 to i-1; this is essentially building the trees above as a graph, but without storing the paths separately; then, the graph can be processed to obtain all the valid solution tuples as induced complete subgraphs in G, i.e.: sets of m vertices, one from each level of the graph, such that every vertex in the set has edges to every single other vertex in the set. 3 A different implementation for multiple-sequence all-againstall matching 3.1 The problem redefined C. Upton told us that he would like results from the multiple-sequence version in useful time, even if the results reported are a superset of the actual solution to the multiple allagainst-all problem. What he specifically wants is a set of patterns, say taken from s 1, which are unusually conserved in all of the sequences. He doesn t necessarily need for any two substrings in the set of matches to each pattern to be within K differences from each other. Hence, as long as the number of the patterns is low enough, he can use the positions of their matches in all of the other sequences to figure which are significant. He especially insisted that the current implementation is too slow to be usable by his lab, so he would like a fast algorithm, as long as no potential solutions are left out [3]. The idea behind the implementation done in the current project is to use APBT to get a set of candidate patterns. Then, we aim to quickly determine the candidates which have matches in all of the sequences and what these matches are. We examined how APBT performs with inputs that would be common to Upton s lab, so that we know the number 7

8 of candidate patterns that will have to be matched against the other sequences. The test sequences were Mycobacteriophage D29 and Mycobacteriophage Bxb1 taken from Barsky s paper [1], related viral genomes of around base pairs; several values of S and K were attempted, with K around 10% of S, as suggested by Upton. Table 1: Tests of APBT on two viral sequences S K Number of solutions Processing time (min.) The strategy for solving the multiple-sequence version is to use APBT on two sequences to get candidate patterns and then compare the candidates to the rest of the sequences using fast pattern-text matching. This can be used to efficiently report a set of matches for each candidate, with at least one match from each sequence. Since all the matches in any one set have an edit distance of K to original pattern in the set, they are guaranteed to have at most 2K differences between each other. After this implementation, if the users require it, we can further process each set to produce sets of tuples, with one substring from each sequence. This will guarantee that any two substrings have an edit distance of at most K. 3.2 Using pattern against text matching as a subroutine We based the pattern-text matching as first described by Baeza-Yates and Gonnet in A New Approach to Text Searching [5]. This algorithm takes advantage of the small size of the pattern to use computer words, bit shifts and logical bitwise operations to determine where matches to the pattern end in the text. For the sizes we re interested in, the pattern can be encoded as several bit masks: there would be one bit mask for each 8

9 letter in the pattern, and a 1 bit would signify that the letter appears at that position. Note that we used Wu and Manber s extension of this algorithm [6] to also match the pattern to the text with up to k single-letter insertions, deletions, or substitutions. The algorithm is given a pattern P, a text T, and values for parameters k and S and then determines all the end positions in T of matches to P with up to k errors. First, we compute a set of bit masks U, one for each character in the text. Then, we compute a set of matrices R 1, R k. Every column in R represents a different position in the text; a bit R d (i, j) is 1 iff a there is a match to the prefix P[1..i] ending at T(j), with d allowed differences. Column j of R d depends on: the column j-1 in R d, the bit mask for character T(j), and some columns in R d-1. The recurrence relation to compute the j+1 th column of R d is: R d (j) = (d 1s) if j = 0 R d (j) = [ Bit-Shift(R d (j-1)) AND U(T(j)) ] if j >= 1 OR Bit-Shift( R d-1 (j-1) ) OR Bit-Shift( R d-1 (j) ) OR R d-1 (j-1) Note that the final bit in any column or R k (j) indicates whether or not there is a match to P ending at T(j) with up to k differences. Since computing a column in R d (j) involves looking up R d (j-1), R d-1 (j), R d-1 (j) and j could be of the order of tens of thousands, we don t store any entire R d table. We consider instead R d (j) values to be cells in a table, with d as one dimension and j as the other. We compute values in the table diagonal by diagonal, storing a position j as a solution when appropriate. In this way, we only need to store the two previous diagonals when computing a new one, which are each at most d values. We also represent the actual values of R d (j) as 64-bit longs; this limits the size of a pattern to search for up to 64 characters, long enough for C. Upton s purposes. The maximum length can easily be changed to 128 characters in the future. For a more complete specification of the algorithm, see the program code and its comments. We tested the implementation of this algorithm for speed. We chose to test patterns of different lengths S and with different K values against the Human coronavirus 229E sequence, which has a size of about base pairs. The worst running time found in 9

10 the trials, using S as 22 and an exaggerated value for K of 9 allowed differences was under 0.2 seconds. These results were promising because a program for all-against-all comparison could use the shift-and as a subroutine to quickly filter out candidate patterns which don t have matches in all the sequences. The pattern-against-text matching could also speed up the substring-against-substring comparisons done in the existing APBT multi-sequence implementation. Table 2: Performance of pattern-against-text matching pattern length allowed differences number of processing time (ms) (chars) solutions Returning candidates and their matches in multiple-sequence comparison The strategy for finding candidates and their matches is as follows: Use APBT to find the candidate matches between 2 preferred sequences (s 1, s 2 ); For each of these candidates as taken from the first sequence s 1 : o o for each of the sequences s 2 to s m, get the end positions of the matches to the candidate; store these in a map, indexed by the candidate; if in any of the sequences, the candidate has 0 matches, remove the candidate from consideration; Report each candidate along with its set of matches from each sequence, after post-processing to find the start positions as well as the end positions of the matches. Pattern-text matching reports only the end positions of the matches. However, the exact coordinates of every match have to be reported in order for the users to visualize the results. After any set of end positions is returned by pattern matching, we can compute the starting positions of the matches and then store the matches as pairs of start and end 10

11 positions. Since the start points have to be between (end position S k) and (end position S + k), we can just test every possible start point. There are then up to k possibilities for the starting point, and each could define a separate match, so the number of matches could grow by a factor of k when doing this. We decided to only store the farthest start position for any given end position, so that the users know that the region between this start point and the end point contains at least one match for the given pattern. Here is an example to illustrate what solutions are reported after this processing (note the substrings are given in this example, not their position): given k = 1, S = 4, and sequences t01 CATC, t02 TTTTCATCTT, t03 CGTCGGCATC. The solutions reported are: CATC from t01 TTTTCATCTT and TTTTCATCTT from t02; CGTCGGCATC and CGTCGGCATC from t03 Note that GCATC is reported from t03 and not CATC, because they have the same end point and GCATC is longer. Also note that if S = 3, then CAT would also be reported as a match to CATC, because even if it is contained completely in the reported GCATC, it has a different end point. In testing the program, we comparing running times with and without also determining finding the start points, to see whether that step reduces the benefits of using the fast pattern-text matching. We found that both the number of matches and the running time remain of the same order of magnitude as before finding the start points. First, we used the same sequences as the ones used by Barsky [1] to test the 2-sequence APBT. These are five viral DNA sequences, D1 to D5, and three animal protein sequences, P1 to P3; for a description of the sequences, refer to [1]. The running time of the algorithm was split into the first step, using APBT on the first two sequences input to get candidate patterns, then the second step, processing the patterns to get their matches from all the sequences. Table 3: Performance of the program 11

12 Test sequences S K APBT time (s) Processing time after APBT (s) APBT # candidates # results remaining D1 D4 D2 D3 D P2 P3 P P2 P3 P Left to complete The example above indicates that overlapping solutions can be reported by the program. One useful addition that would make the results more usable by Upton s lab would be to combine any two matches to one candidate from in the same sequence with any overlap between them. Further, we have to store the matches in a fashion that is useful for the scientists using the program. According to C. Upton s team, they would like the output as one file per candidate pattern and its set of matches. This file should be in Fasta format and include for each of the matches: a description line with the identifier of sequence it is from, and the start and end positions; and a data line, with the actual sequence of this match. 4 Additional features desired in the program Upton s lab has used both the APBT program and the M. Barsky s multiple-sequence version. They noticed that when running the APBT program on very long sequences, such as entire viral genome with reasonable value values of S, e.g. 20, the number of matches reported can be very large. This is unavoidable given that both the lengths and similarities of the input sequences can be varied by the user. What the virologists would like is to input several viral genomes, of length around bp, and have the program predict values of S, K, such that the number of hits is less than We decided that a sampling strategy could be used to try to pinpoint desired values of the parameters. This could be built on top of the multiple-sequence implementation of the program. 12

13 5 Conclusion The greatest advantage of the current implementation is the speed with which it reports the patterns which satisfy the desired criteria. The implementation leaves open many possibilities for post-processing to enforce additional criteria on the output. It also can be extended to solve the multiple-sequence threshold all-against-all substring matching problem, by further processing of the sets of matches reported. 13

14 References 1. BARSKY M., STEGE U., THOMO A. and UPTON C. A graph approach to the threshold all-against-all substring matching problem, Submitted to The Journal of Experimental Algorithmics. 2. BARSKY M. Personal communication, University of Victoria. 3. UPTON C. Personal communication, University of Victoria. 4. BAEZA-YATES R. A. and GONNET G. H. All-against-all sequence matching, Report of the Department of Computer Science, Univ. de Chile. 6. BAEZA-YATES R. A. and GONNET G. H. A New Approach to Text Searching, Communications of the ACM, 35: MANBER U. and WU S. Fast text searching with errors, Department of Computer Science, U. of Arizona. 14

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Suffix Tree and Array

Suffix Tree and Array Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

UNIT 4 Branch and Bound

UNIT 4 Branch and Bound UNIT 4 Branch and Bound General method: Branch and Bound is another method to systematically search a solution space. Just like backtracking, we will use bounding functions to avoid generating subtrees

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Efficient subset and superset queries

Efficient subset and superset queries Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper

More information

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2) Dynamic Programming Any recursive formula can be directly translated into recursive algorithms. However, sometimes the compiler will not implement the recursive algorithm very efficiently. When this is

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

CMPS 102 Solutions to Homework 7

CMPS 102 Solutions to Homework 7 CMPS 102 Solutions to Homework 7 Kuzmin, Cormen, Brown, lbrown@soe.ucsc.edu November 17, 2005 Problem 1. 15.4-1 p.355 LCS Determine an LCS of x = (1, 0, 0, 1, 0, 1, 0, 1) and y = (0, 1, 0, 1, 1, 0, 1,

More information

LAB # 3 / Project # 1

LAB # 3 / Project # 1 DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

Lowest Common Ancestor (LCA) Queries

Lowest Common Ancestor (LCA) Queries Lowest Common Ancestor (LCA) Queries A technique with application to approximate matching Chris Lewis Approximate Matching Match pattern to text Insertion/Deletion/Substitution Applications Bioinformatics,

More information

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Study of Data Localities in Suffix-Tree Based Genetic Algorithms Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

CS521 \ Notes for the Final Exam

CS521 \ Notes for the Final Exam CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )

More information

Consistency and Set Intersection

Consistency and Set Intersection Consistency and Set Intersection Yuanlin Zhang and Roland H.C. Yap National University of Singapore 3 Science Drive 2, Singapore {zhangyl,ryap}@comp.nus.edu.sg Abstract We propose a new framework to study

More information

CSE 101, Winter Design and Analysis of Algorithms. Lecture 11: Dynamic Programming, Part 2

CSE 101, Winter Design and Analysis of Algorithms. Lecture 11: Dynamic Programming, Part 2 CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 11: Dynamic Programming, Part 2 Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ Goal: continue with DP (Knapsack, All-Pairs SPs, )

More information

Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms

Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms Journal of Advances in Information Technology Vol. 7, No. 4, November 016 Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring

More information

CSCE 411 Design and Analysis of Algorithms

CSCE 411 Design and Analysis of Algorithms CSCE 411 Design and Analysis of Algorithms Set 4: Transform and Conquer Slides by Prof. Jennifer Welch Spring 2014 CSCE 411, Spring 2014: Set 4 1 General Idea of Transform & Conquer 1. Transform the original

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors 1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of

More information

A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances 1

A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances 1 A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances 1 Heikki Hyyrö Department of Computer and Information Sciences 33014 University of Tampere Finland e-mail: Heikki.Hyyro@uta.fi

More information

Introduction to Algorithms I

Introduction to Algorithms I Summer School on Algorithms and Optimization Organized by: ACM Unit, ISI and IEEE CEDA. Tutorial II Date: 05.07.017 Introduction to Algorithms I (Q1) A binary tree is a rooted tree in which each node has

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

Graph Algorithms Using Depth First Search

Graph Algorithms Using Depth First Search Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth

More information

implementing the breadth-first search algorithm implementing the depth-first search algorithm

implementing the breadth-first search algorithm implementing the depth-first search algorithm Graph Traversals 1 Graph Traversals representing graphs adjacency matrices and adjacency lists 2 Implementing the Breadth-First and Depth-First Search Algorithms implementing the breadth-first search algorithm

More information

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming (2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence

More information

BUNDLED SUFFIX TREES

BUNDLED SUFFIX TREES Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008 CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008 Name: Honor Code 1. The Honor Code is an undertaking of the students, individually and collectively: a) that they will

More information

Dynamic Programming (cont d) CS 466 Saurabh Sinha

Dynamic Programming (cont d) CS 466 Saurabh Sinha Dynamic Programming (cont d) CS 466 Saurabh Sinha Spliced Alignment Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the

More information

COMP3121/3821/9101/ s1 Assignment 1

COMP3121/3821/9101/ s1 Assignment 1 Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not

More information

March 20/2003 Jayakanth Srinivasan,

March 20/2003 Jayakanth Srinivasan, Definition : A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs of distinct elements of V called edges. Definition : In a multigraph G = (V, E) two or

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Backtracking and Branch-and-Bound

Backtracking and Branch-and-Bound Backtracking and Branch-and-Bound Usually for problems with high complexity Exhaustive Search is too time consuming Cut down on some search using special methods Idea: Construct partial solutions and extend

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

Solving NP-hard Problems on Special Instances

Solving NP-hard Problems on Special Instances Solving NP-hard Problems on Special Instances Solve it in poly- time I can t You can assume the input is xxxxx No Problem, here is a poly-time algorithm 1 Solving NP-hard Problems on Special Instances

More information

Search means finding a path or traversal between a start node and one of a set of goal nodes. Search is a study of states and their transitions.

Search means finding a path or traversal between a start node and one of a set of goal nodes. Search is a study of states and their transitions. UNIT 3 BASIC TRAVERSAL AND SEARCH TECHNIQUES Search means finding a path or traversal between a start node and one of a set of goal nodes. Search is a study of states and their transitions. Search involves

More information

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path.

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path. Chapter 3 Trees Section 3. Fundamental Properties of Trees Suppose your city is planning to construct a rapid rail system. They want to construct the most economical system possible that will meet the

More information

CS/COE 1501 cs.pitt.edu/~bill/1501/ Graphs

CS/COE 1501 cs.pitt.edu/~bill/1501/ Graphs CS/COE 1501 cs.pitt.edu/~bill/1501/ Graphs 5 3 2 4 1 0 2 Graphs A graph G = (V, E) Where V is a set of vertices E is a set of edges connecting vertex pairs Example: V = {0, 1, 2, 3, 4, 5} E = {(0, 1),

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational

More information

15.4 Longest common subsequence

15.4 Longest common subsequence 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible

More information

Graphs. The ultimate data structure. graphs 1

Graphs. The ultimate data structure. graphs 1 Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely

More information

Lecture 3, Review of Algorithms. What is Algorithm?

Lecture 3, Review of Algorithms. What is Algorithm? BINF 336, Introduction to Computational Biology Lecture 3, Review of Algorithms Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Algorithm? Definition A process

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Overlap detection: Semi-Global Alignment An overlap of two sequences is considered an

More information

ACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s

ACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s Problem A The Best Team Time Limit: 2s ACM-ICPC 2010 is drawing near and your university want to select three out of N students to form the best team. The university however, has a limited budget, so they

More information

Backtracking. Chapter 5

Backtracking. Chapter 5 1 Backtracking Chapter 5 2 Objectives Describe the backtrack programming technique Determine when the backtracking technique is an appropriate approach to solving a problem Define a state space tree for

More information

CS 6783 (Applied Algorithms) Lecture 5

CS 6783 (Applied Algorithms) Lecture 5 CS 6783 (Applied Algorithms) Lecture 5 Antonina Kolokolova January 19, 2012 1 Minimum Spanning Trees An undirected graph G is a pair (V, E); V is a set (of vertices or nodes); E is a set of (undirected)

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Steven Skiena. skiena

Steven Skiena.   skiena Lecture 12: Examples of Dynamic Programming (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Give an O(n 2 )

More information

Module 5 Graph Algorithms

Module 5 Graph Algorithms Module 5 Graph lgorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 97 E-mail: natarajan.meghanathan@jsums.edu 5. Graph Traversal lgorithms Depth First

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

tree follows. Game Trees

tree follows. Game Trees CPSC-320: Intermediate Algorithm Design and Analysis 113 On a graph that is simply a linear list, or a graph consisting of a root node v that is connected to all other nodes, but such that no other edges

More information

EE 701 ROBOT VISION. Segmentation

EE 701 ROBOT VISION. Segmentation EE 701 ROBOT VISION Regions and Image Segmentation Histogram-based Segmentation Automatic Thresholding K-means Clustering Spatial Coherence Merging and Splitting Graph Theoretic Segmentation Region Growing

More information

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th. CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 - Due at the beginning of class May 18th. 1. Download the following materials: Slides: http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf

More information

Elements of Graph Theory

Elements of Graph Theory Elements of Graph Theory Quick review of Chapters 9.1 9.5, 9.7 (studied in Mt1348/2008) = all basic concepts must be known New topics we will mostly skip shortest paths (Chapter 9.6), as that was covered

More information

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Tuukka Toivonen and Janne Heikkilä Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering

More information

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved. Chapter 9 Greedy Technique Copyright 2007 Pearson Addison-Wesley. All rights reserved. Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that

More information

Module 6 NP-Complete Problems and Heuristics

Module 6 NP-Complete Problems and Heuristics Module 6 NP-Complete Problems and Heuristics Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu P, NP-Problems Class

More information

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department CS473-Algorithms I Lecture 11 Greedy Algorithms 1 Activity Selection Problem Input: a set S {1, 2,, n} of n activities s i =Start time of activity i, f i = Finish time of activity i Activity i takes place

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

5 Matchings in Bipartite Graphs and Their Applications

5 Matchings in Bipartite Graphs and Their Applications 5 Matchings in Bipartite Graphs and Their Applications 5.1 Matchings Definition 5.1 A matching M in a graph G is a set of edges of G, none of which is a loop, such that no two edges in M have a common

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

Branch-and-bound: an example

Branch-and-bound: an example Branch-and-bound: an example Giovanni Righini Università degli Studi di Milano Operations Research Complements The Linear Ordering Problem The Linear Ordering Problem (LOP) is an N P-hard combinatorial

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Binary Decision Diagrams

Binary Decision Diagrams Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table

More information

Report on the paper Summarization-based Mining Bipartite Graphs

Report on the paper Summarization-based Mining Bipartite Graphs Report on the paper Summarization-based Mining Bipartite Graphs Annika Glauser ETH Zuerich Spring 2015 Extract from the paper [1]: Introduction The paper Summarization-based Mining Bipartite Graphs introduces

More information

BackTracking Introduction

BackTracking Introduction Backtracking BackTracking Introduction Backtracking is used to solve problems in which a sequence of objects is chosen from a specified set so that the sequence satisfies some criterion. The classic example

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Recoloring k-degenerate graphs

Recoloring k-degenerate graphs Recoloring k-degenerate graphs Jozef Jirásek jirasekjozef@gmailcom Pavel Klavík pavel@klavikcz May 2, 2008 bstract This article presents several methods of transforming a given correct coloring of a k-degenerate

More information

S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165

S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165 S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165 5.22. You are given a graph G = (V, E) with positive edge weights, and a minimum spanning tree T = (V, E ) with respect to these weights; you may

More information

Basic Combinatorics. Math 40210, Section 01 Fall Homework 4 Solutions

Basic Combinatorics. Math 40210, Section 01 Fall Homework 4 Solutions Basic Combinatorics Math 40210, Section 01 Fall 2012 Homework 4 Solutions 1.4.2 2: One possible implementation: Start with abcgfjiea From edge cd build, using previously unmarked edges: cdhlponminjkghc

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real 14.3 Interval trees We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real numbers ], with Interval ]represents the set Open and half-open intervals

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Lecture 25 Spanning Trees

Lecture 25 Spanning Trees Lecture 25 Spanning Trees 15-122: Principles of Imperative Computation (Fall 2018) Frank Pfenning, Iliano Cervesato The following is a simple example of a connected, undirected graph with 5 vertices (A,

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Ma/CS 6b Class 13: Counting Spanning Trees

Ma/CS 6b Class 13: Counting Spanning Trees Ma/CS 6b Class 13: Counting Spanning Trees By Adam Sheffer Reminder: Spanning Trees A spanning tree is a tree that contains all of the vertices of the graph. A graph can contain many distinct spanning

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

6. Finding Efficient Compressions; Huffman and Hu-Tucker

6. Finding Efficient Compressions; Huffman and Hu-Tucker 6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?

More information

Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace

Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace CHICAGO JOURNAL OF THEORETICAL COMPUTER SCIENCE 2014, Article 2, pages 1 29 http://cjtcs.cs.uchicago.edu/ Reachability in K 3,3 -free and K 5 -free Graphs is in Unambiguous Logspace Thomas Thierauf Fabian

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

String Patterns and Algorithms on Strings

String Patterns and Algorithms on Strings String Patterns and Algorithms on Strings Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 11 Objectives To introduce the pattern matching problem and the important of algorithms

More information

Dynamic Programming II

Dynamic Programming II June 9, 214 DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? DP: Longest common

More information

Boolean Representations and Combinatorial Equivalence

Boolean Representations and Combinatorial Equivalence Chapter 2 Boolean Representations and Combinatorial Equivalence This chapter introduces different representations of Boolean functions. It then discusses the applications of these representations for proving

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information