Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Size: px
Start display at page:

Download "Lecture 3: February Local Alignment: The Smith-Waterman Algorithm"

Transcription

1 CSCI1820: Sequence Alignment Spring 2017 Lecture 3: February 7 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from notes from previous years offerings of CSCI1810 and CCSCI1820. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 3.1 Local Alignment: The Smith-Waterman Algorithm The Smith-Waterman algorithm is one of the most important and useful algorithms in computational biology. In the last lecture, we looked at both global and local alignment algorithms; here we will dive deeper into the intuition behind the local alignment algorithm, more formally known as the Smith-Waterman Algorithm Prefixes and Suffixes Before formalizing the above intuition and defining local alignment, we need a few definitions. A prefix a of a string b is a string that may be obtained by removing characters from the end of b. Similarly, a suffix c of a string b may be obtained by removing characters from the beginning of b. Then, a substring or subsequence 1 a of a string b may be obtained by removing characters from either end of b (but not from the middle). See Figure 3.1 for examples of prefixes, suffixes, and substrings of the sequence ACGAAT. Note that in Figure 3.1, we use ɛ to represent the empty string, which is also technically a prefix and a suffix of a string. Furthermore, a string itself is also a prefix and suffix of itself. Excluding these two edge cases, though, gives us the proper prefixes and proper suffixes of a string. At this time, you should convince yourself that the set of all suffixes of prefixes of a string x is equivalent to the set of all prefixes of suffixes of x, and both are equivalent to the set of all substrings of x Local Alignment Definition In Lecture 2, we had defined local alignment in a general sense. Here, we will describe it more formally. The optimal local alignment of two strings X and Y is simply the optimal global alignment of any X, Y such that X is a substring of X and Y is a substring of Y. This definition is deceptively simple. Note that it encapsulates our intuition, allowing us to identify regions of two strings that align well, even when the remainder of the strings aligns poorly. Furthermore, we shall soon see an efficient algorithm (Smith-Waterman) exists to calculate optimal local alignments that is asymptotically equivalent to the Needleman-Wunsch algorithm, which is the formal name for global alignment. 1 Note that some authors define subsequence such that the characters can be removed from the middle of a sequence; this definition is not of interest to us in this context. The term substring is generally less ambiguous. 3-1

2 3-2 Lecture 3: February 7 Prefixes: Suffixes: ɛ A AC ACG ACGA ACGAA ACGAAT ɛ T AT AAT GAAT CGAAT ACGAAT Substrings: ACGAAT ACGAA ACGA ACG AC A CGAAT CGAA CGA CG C GAAT GAA GA G AAT AA AT T ɛ Figure 3.1: Prefixes, Suffixes, and Substrings of ACGAAT Problem Description We present here, as in last lecture, the problem of local alignment. Some of the notation is slightly different to get you used to different ways of representing the same problem. The same ideas are all present, though. Given: An alphabet, Σ. A similarity (or scoring) matrix, δ. Two sequences, X and Y, such that X = x 1 x 2 x 3... x n = x Y = y 1 y 2 y 3... y m = y where x i, y j Σ for all i [1, n] and j [1, m]. In other words, x Σ n and y Σ m. Compute: The score of the optimal local alignment. Let α be a subsequence of X, and β be a subsequence of Y. Let r be the score of the global alignment of α with β. The maximum score of the local alignment between X and Y, r, which is the global alignment score of α and β, where α Subseq(x) and β Subseq(Y ). Note that Subseq(S) refers to the set of substrings of the string S A Description of the Smith Waterman Algorithm The Smith Waterman Algorithm can be thought of as carrying out the following steps (from a high level):

3 Lecture 3: February Consider two sequences, X and Y. Let X i be the ith prefix of X, and let Y j be the jth prefix of Y. As an example, X 2 of our previous sequence ACGAAT would be AC, while Y 3 of a different sequence ACT GAG would be ACT. Take a suffix of X i, and take a suffix of Y j. Let V (i, j) be the value of the optimal global alignment between these two chosen suffixes. The idea is to take every possible global alignment for all the suffixes of X i and Y j and find the optimal score from a certain pair of suffixes and store it in V (i, j) Pseudocode and Key Concepts Here, as we did in the last lecture, we will present the pseudocode for the local alignment algorithm: 1: function Local Alignment(x Σ n, y Σ m ) 2: V 0,0 0 3: for i {1, 2,..., n} do 4: V i,0 0 5: for j {1, 2,..., m} do 6: V 0,j 0 7: for i {1, 2,..., n} do 0 V 8: V i,j max i 1,j 1 + δ(x i, y j ) V i 1,j + δ(x i, ) V i,j 1 + δ(, y j ) 9: return max V i,j i {0, 1,..., n} j {0, 1,..., m} Note that V (i, 0) = V (0, j) = 0 for all i and j. Also note how just as in global alignment, our matrix here has (N + 1) columns and (M + 1) rows. You can switch these dimensions, but it would require a slight readjusting of the loops in our code in order to fill in the table row-wise rather than column-wise. You ll notice that these finer details don t really quite matter. For i > 0 and j > 0, we replace V (i, j) with the maximum score from four different scenarios: beginning a new local alignment, keeping both X i and Y j, aligning X i with a gap, and aligning Y j with a gap. See the last lecture for a bit more info on the dynamic programming aspect of this all. At last, the maximum value in the matrix V gives you the score of the optimal local alignment. There is also the concept of backtracking (or traceback) that is necessary to construct the actual optimal local alignment (rather than just have its score). In global alignment, we start constructing our alignment from the edit graph using the value from the bottom right corner of our matrix, while in local you can start from anywhere in the V matrix that contains the matrix s maximal score Runtime of Local Alignment The Smith-Waterman algorithm is an algorithm of order NM. A unit of time can be any one of addition, subtraction, assignment of a variable, and other small, constant operations of that scale.

4 3-4 Lecture 3: February 7 Note that our V matrix is (N + 1) by (M + 1), and we do the following calculations in the algorithm: The initialization step is (N M + 1) in order to fill in each of the matrix values where i = 0 or j = 0. The non-initialization loops occur (N M) times. On each loop, we find the max of four numbers. The minimum number of operations to find the max of four numbers is 3. Three of these four (the 0 doesn t involve any sort of computation) numbers need to be computed, which involves about 3 operations each (adding, accessing the appropriate element from V and δ). Thus, we have about = 12 units of time for finding one V (i, j) for i > 0 and j > 0. We find that the approximate total work done is on the order of: (N M + 1) + 12NM = NM NM time. This is the approximate time to complete our dynamic programming matrix V. For those familiar with big-o notation, we would say the algorithm runs in O(NM). 3.2 Affine Gap Alignment Gap Theory So far, we have used the word gap to represent aligning a character of one string with the - character in the other string. In general, though, gaps can be several dashes in a row. To distinguish between these gap clusters and just a single - character, we ll refer to just a single - as an indel. From a biological context, it may be more useful to consider aligning two sequences such that clusters of gaps are preferred to single indels spread out through the alignment. One such example of this scenario occurs when thinking about trying to align a sequence of DNA with introns removed (perhaps by reverse engineering the DNA sequence from a known protein sequence), and the entire length of DNA on a chromosome, to detect where in the chromosome the gene corresponding to the intron-removed sequence might be. In cases like these, you want to do the alignment so that there are fewer gaps (clusters). This can be done by doing a simple change to local alignment, involving creating a scenario where the penalty for k indels in a row (a gap of length k) is less than the penalty for k times the penalty for a single indel. In general, there are different types of gapped alignment algorithms that treat indels and gap clusters differently, but we will mainly focus on the affine gap alignment, presented below Alternate Notation of Global Alignment Before we explore the idea of preferring gap clusters to indels, we will introduce the notation for the global alignment problem used by Smith and Waterman. This notation doesn t change the algorithm, but it does make it easier for us to understand what will happen during the new gap alignment algorithm presented in this class. First, we define several different variables: Let s(a, b) be a similarity score function, with a and b in the alphabet that gives the similarity between characters a and b.

5 Lecture 3: February Let a = a 1 a 2 a 3... a n and b = b 1 b 2 b 3... b m, with a = n and b = m. Let S represent a function applied to prefixes of a and b, such that S(a 1 a 2... a i, b 1 b 2... b j ) equals the similarity score of the global alignment of prefix i of a with prefix j of b. Then, the following initializations occur: S(0, 0) = 0 S(i, 0) = i l=1 S(a l, ) S(0, j) = j k=1 S(, b k) And the main portion of the algorithm has the following notation: S i 1,j 1 + s(x i, y j ) S i,j max S i 1,j + s(x i, ) S i,j 1 + s(, y j ) Finally, we see that S(a, b) is the maximum value over all the entire sequences a and b. Again, this is all just a different notation of global alignment. The reason we present it is because it is concise, and also makes the gap alignment notation a lot easier to look at. The above may be called the Smith Waterman notation of alignment Affine Gap Overview, Notation, and Recurrence The affine gap alignment algorithm prefers alignments that have a small number of large gaps by introducing a penalty for opening a gap cluster, as well as a penalty for each indel after the opening of a cluster. Firstly, let us define a few variables to add onto the the Smith-Waterman notation above: Let H i,j = { 0, max 1 p i n S(a p a p+1 a p+2... a i 2 a i 1 a i, b q b q+1 b q+2... b j 2 b j 1 b j ) 1 q j m Let α represent the penalty for opening a gap cluster. Let β represent the penalty for continuing a gap cluster. Consider a cluster of single indels of length k. Let the cost or score of the k length cluster be g(k) = α + β(k 1) We will typically see the gap function, g(k) with a negative sign in front of it to signify that it is a penalty, g(k). In a similar vein, let H(a, b) = max S(a i a i+1 a i+2... a j 2 a j 1 a j, b k b k+1 b k+2... b l 2 b l 1 b l ) 1 i j n 1 k l m What this represents is finding the maximum similarity score over all subsequences of the strings a and b.

6 3-6 Lecture 3: February 7 Now, we get into the major recurrences of the affine gap algorithm using the above notation. Consider three matrices, E, F, and H. Let the following be used to define these matrices: E i,j = F i,j = H i,j = 0 for i j = 0 { Hi,j 1 α, E i.j = max E i,j 1 β { Hi 1,j α, F i.j = max F i 1,j β 0, E H i.j = max i,j F i,j H i 1,j 1 + s(a i, b j ) Affine Gap Algorithm/Notation Explanation With some intuition on the notation for the affine gap algorithm, we can now examine what the different parts of the algorithm mean. Firstly, realize that in order for the algorithm to prefer having a small number of large gaps, α should be big to penalize creating a gap, while β should be small to not penalize continuing a gap so much. What the matrix E represents is the optimal score of the ith prefix of a and jth prefix of b in the case that the alignment will align the jth character of b with a - character. F represents the optimal score for when the - occurs in string a. In either case, though, by considering that you are including an indel at your current location, you are either opening up a new gap cluster (meaning you subtract α, or continuing an existing one (meaning you subtract β). At the same time, the H matrix stores the optimal score for the alignments taking into account our modified gap penalties that favor gap clusters. Traceback (or backtracking) will also be slightly different here compared to local and global alignment, since when you construct your new alignment, you are not just considering the H matrix, but you may actually find yourself moving back and forth between all three matrices in order to reconstruct your alignment.

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Overlap detection: Semi-Global Alignment An overlap of two sequences is considered an

More information

Algorithmic Approaches for Biological Data, Lecture #20

Algorithmic Approaches for Biological Data, Lecture #20 Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016 Outline Aligning with Gaps and Substitution Matrices

More information

Mouse, Human, Chimpanzee

Mouse, Human, Chimpanzee More Alignments 1 Mouse, Human, Chimpanzee Mouse to Human Chimpanzee to Human 2 Mouse v.s. Human Chromosome X of Mouse to Human 3 Local Alignment Given: two sequences S and T Find: substrings of S and

More information

Programming assignment for the course Sequence Analysis (2006)

Programming assignment for the course Sequence Analysis (2006) Programming assignment for the course Sequence Analysis (2006) Original text by John W. Romein, adapted by Bart van Houte (bart@cs.vu.nl) Introduction Please note: This assignment is only obligatory for

More information

Lecture 9: Core String Edits and Alignments

Lecture 9: Core String Edits and Alignments Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:

More information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

DNA Alignment With Affine Gap Penalties

DNA Alignment With Affine Gap Penalties DNA Alignment With Affine Gap Penalties Laurel Schuster Why Use Affine Gap Penalties? When aligning two DNA sequences, one goal may be to infer the mutations that made them different. Though it s impossible

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77 Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77 Dynamic Programming Recall: the Change Problem Other problems: Manhattan

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

Brief review from last class

Brief review from last class Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it

More information

Lecture 18: March 23

Lecture 18: March 23 0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties From LCS to Alignment: Change the Scoring The Longest Common Subsequence (LCS) problem the simplest form of sequence

More information

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions U.C. Berkeley CS70 : Algorithms Midterm 2 Solutions Lecturers: Sanjam Garg and Prasad aghavra March 20, 207 Midterm 2 Solutions. (0 points) True/False Clearly put your answers in the answer box in front

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Local Alignment & Gap Penalties CMSC 423

Local Alignment & Gap Penalties CMSC 423 Local Alignment & ap Penalties CMSC 423 lobal, Semi-global, Local Alignments Last time, we saw a dynamic programming algorithm for global alignment: both strings s and t must be completely matched: s t

More information

Gaps ATTACGTACTCCATG ATTACGT CATG. In an edit script we need 4 edit operations for the gap of length 4.

Gaps ATTACGTACTCCATG ATTACGT CATG. In an edit script we need 4 edit operations for the gap of length 4. Gaps ATTACGTACTCCATG ATTACGT CATG In an edit script we need 4 edit operations for the gap of length 4. In maximal score alignments we treat the dash " " like any other character, hence we charge the s(x,

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Regular Expression Constrained Sequence Alignment

Regular Expression Constrained Sequence Alignment Regular Expression Constrained Sequence Alignment By Abdullah N. Arslan Department of Computer science University of Vermont Presented by Tomer Heber & Raz Nissim otivation When comparing two proteins,

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology? Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Lecture 15: Log Barrier Method

Lecture 15: Log Barrier Method 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 15: Log Barrier Method Scribes: Pradeep Dasigi, Mohammad Gowayyed Note: LaTeX template courtesy of UC Berkeley EECS dept.

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides

More information

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2) Dynamic Programming Any recursive formula can be directly translated into recursive algorithms. However, sometimes the compiler will not implement the recursive algorithm very efficiently. When this is

More information

DynamicProgramming. September 17, 2018

DynamicProgramming. September 17, 2018 DynamicProgramming September 17, 2018 1 Lecture 11: Dynamic Programming CBIO (CSCI) 4835/6835: Introduction to Computational Biology 1.1 Overview and Objectives We ve so far discussed sequence alignment

More information

Lecture 10: Local Alignments

Lecture 10: Local Alignments Lecture 10: Local Alignments Study Chapter 6.8-6.10 1 Outline Edit Distances Longest Common Subsequence Global Sequence Alignment Scoring Matrices Local Sequence Alignment Alignment with Affine Gap Penalties

More information

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

CSE 417 Dynamic Programming (pt 5) Multiple Inputs CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions

More information

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018 CS124 Lecture 11 Spring 2018 Memoization/Dynamic Programming Today s lecture discusses memoization, which is a method for speeding up algorithms based on recursion, by using additional memory to remember

More information

Central Issues in Biological Sequence Comparison

Central Issues in Biological Sequence Comparison Central Issues in Biological Sequence Comparison Definitions: What is one trying to find or optimize? Algorithms: Can one find the proposed object optimally or in reasonable time optimize? Statistics:

More information

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta EECS 4425: Introductory Computational Bioinformatics Fall 2018 Suprakash Datta datta [at] cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku.ca/course/4425 Many

More information

5.1 The String reconstruction problem

5.1 The String reconstruction problem CS125 Lecture 5 Fall 2014 5.1 The String reconstruction problem The greedy approach doesn t always work, as we have seen. It lacks flexibility; if at some point, it makes a wrong choice, it becomes stuck.

More information

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment Today s Lecture Edit graph & alignment algorithms Smith-Waterman algorithm Needleman-Wunsch algorithm Local vs global Computational complexity of pairwise alignment Multiple sequence alignment 1 Sequence

More information

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly

More information

Sequence Alignment. COMPSCI 260 Spring 2016

Sequence Alignment. COMPSCI 260 Spring 2016 Sequence Alignment COMPSCI 260 Spring 2016 Why do we want to compare DNA or protein sequences? Find genes similar to known genes IdenGfy important (funcgonal) sequences by finding conserved regions As

More information

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Assumptions: Biological sequences evolved by evolution. Micro scale changes: For short sequences (e.g. one

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Divide and Conquer. Bioinformatics: Issues and Algorithms. CSE Fall 2007 Lecture 12

Divide and Conquer. Bioinformatics: Issues and Algorithms. CSE Fall 2007 Lecture 12 Divide and Conquer Bioinformatics: Issues and Algorithms CSE 308-408 Fall 007 Lecture 1 Lopresti Fall 007 Lecture 1-1 - Outline MergeSort Finding mid-point in alignment matrix in linear space Linear space

More information

Lecture 2: August 29, 2018

Lecture 2: August 29, 2018 10-725/36-725: Convex Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 2: August 29, 2018 Scribes: Adam Harley Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Sequence comparison: Local alignment

Sequence comparison: Local alignment Sequence comparison: Local alignment Genome 559: Introuction to Statistical an Computational Genomics Prof. James H. Thomas http://faculty.washington.eu/jht/gs559_217/ Review global alignment en traceback

More information

Divide & Conquer Algorithms

Divide & Conquer Algorithms Divide & Conquer Algorithms Outline 1. MergeSort 2. Finding the middle vertex 3. Linear space sequence alignment 4. Block alignment 5. Four-Russians speedup 6. LCS in sub-quadratic time Section 1: MergeSort

More information

CS2220: Introduction to Computational Biology Lecture 5: Essence of Sequence Comparison. Limsoon Wong

CS2220: Introduction to Computational Biology Lecture 5: Essence of Sequence Comparison. Limsoon Wong For written notes on this lecture, please read chapter 10 of The Practical Bioinformatician CS2220: Introduction to Computational Biology Lecture 5: Essence of Sequence Comparison Limsoon Wong 2 Plan Dynamic

More information

Lecture 12: Divide and Conquer Algorithms

Lecture 12: Divide and Conquer Algorithms Lecture 12: Divide and Conquer Algorithms Study Chapter 7.1 7.4 1 Divide and Conquer Algorithms Divide problem into sub-problems Conquer by solving sub-problems recursively. If the sub-problems are small

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Divide & Conquer Algorithms

Divide & Conquer Algorithms Divide & Conquer Algorithms Outline MergeSort Finding the middle point in the alignment matrix in linear space Linear space sequence alignment Block Alignment Four-Russians speedup Constructing LCS in

More information

Sequence alignment algorithms

Sequence alignment algorithms Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 23 rd 27 After this lecture, you can decide when to use local and global sequence alignments

More information

Lecture 5: July 7,2013. Minimum Cost perfect matching in general graphs

Lecture 5: July 7,2013. Minimum Cost perfect matching in general graphs CSL 865: Algorithmic Graph Theory Semester-I 2013-14 Lecture 5: July 7,2013 Lecturer: Naveen Garg Scribes: Ankit Anand Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Lecture 19: November 5

Lecture 19: November 5 0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

BMI/CS 576 Fall 2015 Midterm Exam

BMI/CS 576 Fall 2015 Midterm Exam BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Pairwise alignment II

Pairwise alignment II Pairwise alignment II Agenda - Previous Lesson: Minhala + Introduction - Review Dynamic Programming - Pariwise Alignment Biological Motivation Today: - Quick Review: Sequence Alignment (Global, Local,

More information

Sequencing Alignment I

Sequencing Alignment I Sequencing Alignment I Lectures 16 Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence

More information

Lectures 12 and 13 Dynamic programming: weighted interval scheduling

Lectures 12 and 13 Dynamic programming: weighted interval scheduling Lectures 12 and 13 Dynamic programming: weighted interval scheduling COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures 12-13: Dynamic Programming 1 Overview Last week: Graph

More information

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday Divide and Conquer Algorithms Problem Set #3 is graded Problem Set #4 due on Thursday 1 The Essence of Divide and Conquer Divide problem into sub-problems Conquer by solving sub-problems recursively. If

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 1 Introduction Today, we will introduce a fundamental algorithm design paradigm, Divide-And-Conquer,

More information

Multiple Sequence Alignment Augmented by Expert User Constraints

Multiple Sequence Alignment Augmented by Expert User Constraints Multiple Sequence Alignment Augmented by Expert User Constraints A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of

More information

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي للعام الدراسي: 2017/2016 The Introduction The introduction to information theory is quite simple. The invention of writing occurred

More information

Lecture 2: August 29, 2018

Lecture 2: August 29, 2018 10-725/36-725: Convex Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 2: August 29, 2018 Scribes: Yingjing Lu, Adam Harley, Ruosong Wang Note: LaTeX template courtesy of UC Berkeley EECS dept.

More information

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh We saw earlier how we can use a concave gap penalty function γ, i.e. one that satisfies γ(x+1) γ(x)

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles Today s Lecture Multiple sequence alignment Improved scoring of pairwise alignments Affine gap penalties Profiles 1 The Edit Graph for a Pair of Sequences G A C G T T G A A T G A C C C A C A T G A C G

More information

Announcements. CS243: Discrete Structures. Strong Induction and Recursively Defined Structures. Review. Example (review) Example (review), cont.

Announcements. CS243: Discrete Structures. Strong Induction and Recursively Defined Structures. Review. Example (review) Example (review), cont. Announcements CS43: Discrete Structures Strong Induction and Recursively Defined Structures Işıl Dillig Homework 4 is due today Homework 5 is out today Covers induction (last lecture, this lecture, and

More information

Alignment ABC. Most slides are modified from Serafim s lectures

Alignment ABC. Most slides are modified from Serafim s lectures Alignment ABC Most slides are modified from Serafim s lectures Complete genomes Evolution Evolution at the DNA level C ACGGTGCAGTCACCA ACGTTGCAGTCCACCA SEQUENCE EDITS REARRANGEMENTS Sequence conservation

More information

Computational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh

Computational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh Computational iology Lecture : Physical mapping by restriction mapping Saad Mneimneh In the beginning of the course, we looked at genetic mapping, which is the problem of identify the relative order of

More information

Longest Common Subsequence, Knapsack, Independent Set Scribe: Wilbur Yang (2016), Mary Wootters (2017) Date: November 6, 2017

Longest Common Subsequence, Knapsack, Independent Set Scribe: Wilbur Yang (2016), Mary Wootters (2017) Date: November 6, 2017 CS161 Lecture 13 Longest Common Subsequence, Knapsack, Independent Set Scribe: Wilbur Yang (2016), Mary Wootters (2017) Date: November 6, 2017 1 Overview Last lecture, we talked about dynamic programming

More information

Sequence Alignment 1

Sequence Alignment 1 Sequence Alignment 1 Nucleotide and Base Pairs Purine: A and G Pyrimidine: T and C 2 DNA 3 For this course DNA is double-helical, with two complementary strands. Complementary bases: Adenine (A) - Thymine

More information

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Barry Strengholt Matthijs Brobbel Delft University of Technology Faculty of Electrical Engineering, Mathematics

More information

Advanced Algorithms Class Notes for Monday, November 10, 2014

Advanced Algorithms Class Notes for Monday, November 10, 2014 Advanced Algorithms Class Notes for Monday, November 10, 2014 Bernard Moret Divide-and-Conquer: Matrix Multiplication Divide-and-conquer is especially useful in computational geometry, but also in numerical

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri 1 Introduction Today, we will introduce a fundamental algorithm design paradigm,

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Introduction to Computational Molecular Biology

Introduction to Computational Molecular Biology 18.417 Introduction to Computational Molecular Biology Lecture 13: October 21, 2004 Scribe: Eitan Reich Lecturer: Ross Lippert Editor: Peter Lee 13.1 Introduction We have been looking at algorithms to

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note MS: Bioinformatic lgorithms, Databases and ools Lecture 8 Sequence alignment: inexact alignment dynamic programming, gapped alignment Note Lecture 7 suffix trees and suffix arrays will be rescheduled Exact

More information

Algorithms. Lecture Notes 5

Algorithms. Lecture Notes 5 Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

Dynamic Programming. Lecture Overview Introduction

Dynamic Programming. Lecture Overview Introduction Lecture 12 Dynamic Programming 12.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach

More information

Biostrings. Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA June 2009

Biostrings. Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA June 2009 Biostrings Martin Morgan Bioconductor / Fred Hutchinson Cancer Research Center Seattle, WA, USA 15-19 June 2009 Biostrings Representation DNA, RNA, amino acid, and general biological strings Manipulation

More information

Divide and Conquer Algorithms

Divide and Conquer Algorithms CSE341T 09/13/2017 Lecture 5 Divide and Conquer Algorithms We have already seen a couple of divide and conquer algorithms in this lecture. The reduce algorithm and the algorithm to copy elements of the

More information