Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Similar documents
EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function

1. Basic idea: Use smaller instances of the problem to find the solution of larger instances

Dynamic Programming Comp 122, Fall 2004

9/29/09 Comp /Comp Fall

Dynamic Programming: Edit Distance

Pairwise Sequence alignment Basic Algorithms

DNA Alignment With Affine Gap Penalties

EECS730: Introduction to Bioinformatics

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

3/5/2018 Lecture15. Comparing Sequences. By Miguel Andrade at English Wikipedia.

Computational Molecular Biology

Sequence Alignment. part 2

DynamicProgramming. September 17, 2018

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Notes on Dynamic-Programming Sequence Alignment

Biology 644: Bioinformatics

Lecture 10. Sequence alignments

Mouse, Human, Chimpanzee

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

EECS730: Introduction to Bioinformatics

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Lecture 10: Local Alignments

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Local Alignment & Gap Penalties CMSC 423

Special course in Computer Science: Advanced Text Algorithms

Divide and Conquer. Bioinformatics: Issues and Algorithms. CSE Fall 2007 Lecture 12

Lecture 12: Divide and Conquer Algorithms

Algorithmic Approaches for Biological Data, Lecture #20

Divide & Conquer Algorithms

Divide & Conquer Algorithms

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Dynamic Programming

A Design of a Hybrid System for DNA Sequence Alignment

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Dynamic Programming & Smith-Waterman algorithm

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Sequence analysis Pairwise sequence alignment

Sequence alignment algorithms

Shortest Path Algorithm

Alignment of Long Sequences

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Computational Genomics and Molecular Biology, Fall

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

Computational Molecular Biology

6.1 The Power of DNA Sequence Comparison

Dynamic Programming. Lecture Overview Introduction

From Smith-Waterman to BLAST

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

Longest Common Subsequence

5.1 The String reconstruction problem

Alignment of Pairs of Sequences

Bioinformatics explained: Smith-Waterman

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Sequence Alignment 1

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Bioinformatics for Biologists

CMPS 102 Solutions to Homework 7

BGGN 213 Foundations of Bioinformatics Barry Grant

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

BLAST & Genome assembly

Divide & Conquer Algorithms

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Dynamic Programming II

Sept. 9, An Introduction to Bioinformatics. Special Topics BSC5936:

Lecture 9: Core String Edits and Alignments

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

The Dot Matrix Method

Database Searching Using BLAST

Pairwise alignment II

Reminder: sequence alignment in sub-quadratic time

Advanced Algorithms Class Notes for Monday, November 10, 2014

COMP 251 Winter 2017 Online quizzes with answers

Algorithm Design and Analysis

Regular Expression Constrained Sequence Alignment

Lectures 12 and 13 Dynamic programming: weighted interval scheduling

Least Squares; Sequence Alignment

Write an algorithm to find the maximum value that can be obtained by an appropriate placement of parentheses in the expression

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Lecture 9 Graph Traversal

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

1 Dynamic Programming

Pairwise Sequence Alignment. Zhongming Zhao, PhD

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Transcription:

Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77

Dynamic Programming Recall: the Change Problem Other problems: Manhattan Tourist Problem, LCS Problem Finally: Sequence alignments Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 2 / 77

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 3 / 77

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 4 / 77

Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled source" and the other labeled sink" Output: A longest path in G from source" to sink" Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 5 / 77

MTP: An example Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 6 / 77

MTP: Greedy Algorithm Is Not Optimal Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 7 / 77

MTP: Simple Recursive Program 1 MT(n,m) 2 if n = 0 or m = 0 3 return MT(n,m) 4 x MT(n-1,m)+ length of the edge from (n 1, m) to (n, m) 5 y MT(n,m-1)+length of the edge from (n, m 1) to (n, m) 6 return max{x,y} Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 8 / 77

MTP: Simple Recursive Program 1 MT(n,m) 2 if n = 0 or m = 0 3 return MT(n,m) 4 x MT(n-1,m)+ length of the edge from (n 1, m) to (n, m) 5 y MT(n,m-1)+length of the edge from (n, m 1) to (n, m) 6 return max{x,y} What s wrong with this approach? Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 8 / 77

MTP: Dynamic Programming Calculate optimal path score for each vertex in the graph Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 9 / 77

MTP: Dynamic Programming Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 10 / 77

MTP: Dynamic Programming Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 11 / 77

MTP: Dynamic Programming Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 12 / 77

MTP: Dynamic Programming Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 13 / 77

MTP: Dynamic Programming Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 14 / 77

MTP: Recurrence Computing the score for a point (i, j) by the recurrence relation: { si 1,j + weight of the edge between (i 1, j)and (i, j) s i,j = max s i,j 1 + weight of the edge between (i, j 1)and (i, j) The running time is n m for a n by m grid (n = # of rows, m = # of columns) Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 15 / 77

Manhattan is not a perfect Grid Represented as a DAG: Directed Acyclic Graph The score at point B is given by: s i,j = max { si 1,j + weight of the edge between(i 1, j)and(i, j) s i,j 1 + weight of the edge between(i, j 1)and(i, j) Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 16 / 77

Manhattan Is Not A Perfect Grid Computing the score for point x is given by the recurrence relation: { sy + weight of vertex(y, x)where s x = max y Predecessors(x) Predecessors(x): set of vertices that have edges leading to x. The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 17 / 77

Longest Path in DAG Problem Goal: Find a longest path between two vertices in a weighted DAG Input: A weighted DAG G with source and sink vertices Output: A longest path in G from source to sink Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 18 / 77

Longest Path in DAG: Dynamic Programming Suppose vertex v has indegree 3 and predecessors {u 1, u 2, u 3 } Longest path to v from source is: s u1 + weight of edge from u 1 to v s v = max s u2 + weight of edge from u 2 to v s u3 + weight of edge from u 3 to v In general: s v = max u {s u + weight of edge from u to v} Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 19 / 77

Traveling in the Grid The only hitch is that one must decide on the order in which visit the vertices By the time the vertex x is analyzed, the values s y for all its predecessors y should be computed - otherwise we are in trouble We need to traverse the vertices in some order Try to find such order for a directed cycle??? Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 20 / 77

Topological ordering 2 different topological orderings of the DAG Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 21 / 77

Traversing the Manhattan Grid 3 different strategies: a) Column by column b) Row by row c) Along diagonals ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 22 / 77

Pseudo-code MTP: Dynamic Programming 1 ManhattanTourist(w, w,w i,j,n,m) 2 for i 1 to n 3 s i,0 s i 1,0 + w i,0 4 for j 1 to m 5 s 0,j s 0,j 1 + w 0,j 6 for i 1 to n 7 for j 1 to m 8 s i 1,j + w i,j s i,j = max s i,j 1 + w i,j s i 1,j 1 + w i,j 9 return s n,m Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 23 / 77

Dynamic Programming Part II: Edit Distance & Alignments Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 24 / 77

Aligning DNA Sequences Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 25 / 77

LCS Alignment without mismatches LCS: Longest Common Subsequence Given two sequences v = v 1 v 2...v m and w = w 1 w 2...w n The LCS of v and w is a sequence of positions in v : 1 < i 1 < i 2 <... < i t < m and a sequence of positions in w : 1 < j 1 < j 2 <... < j t < n such that i t -th letter of v equals to j t -th letter of w and t is maximal Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 26 / 77

LCS: Example Every common subsequence is a path in 2-D grid Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 27 / 77

LCS Problem as Manhattan Tourist Problem Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 28 / 77

Edit Graph for LCS Problem Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 29 / 77

Edit Graph for LCS Problem Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence. LCS Problem: Find a path with maximum number of diagonal edges. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 30 / 77

Computing LCS Let v i = prefix of v of length i: v 1...v i and w j = prefix of w of length j: w 1...w j Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 31 / 77

Every Path in the Grid Corresponds to an Alignment Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 32 / 77

Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and w : The Hamming distance: d H (v, w) = 8 is large but the sequences are very similar Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 33 / 77

Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: The edit distance: d L (v, w) = 2 Hamming distance neglects insertions and deletions in DNA Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 34 / 77

Levenshtein or edit distance Definition The Levenshtein distance or edit distance d L between two sequences X and Y is is the minimum number of edit operations of type Replacement, Insertion, or Deletion, that one needs to transform sequence X into sequence Y : d L (X, Y ) = min{r(x, Y ) + I (X, Y ) + D(X, Y )} Using M for match, an edit transcript is a string over the alphabet I, D, R, M that describes a transformation of X to Y. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 35 / 77

Levenshtein or edit distance X = YESTERDAY Example: Given two strings. Y = EASTERS Here is a minimum edit transcript for the above example: edit transcript= D M I M M M M R D D X= Y E S T E R D A Y Y= E A S T E R S The edit distance d L (X, Y ) of X, Y is 5. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 36 / 77

Edit Distance vs Hamming Distance Hamming distance always compares i th letter of v with i th letter of w Hamming distance: d(v, w) = 8 Computing Hamming distance is a trivial task. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 37 / 77

Edit Distance vs Hamming Distance Hamming distance always compares i th letter of v with i th letter of w Edit distance may compare i th letter of v with j th letter of w Hamming distance: d(v, w) = 8 Computing Hamming distance is a trivial task. Edit distance: d(v, w) = 2 Computing edit distance is a non-trivial task. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 37 / 77

Edit Distance: Example ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 38 / 77

Edit Distance: Example What is the edit distance? 5? Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 38 / 77

Edit Distance: Example ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 39 / 77

Edit Distance: Example Can it be done in 3 steps? Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 39 / 77

The Alignment Grid Every alignment path is from source to sink ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 40 / 77

Alignment as a path in the Edit Graph ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 41 / 77

Alignments in Edit Graph ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 42 / 77

Alignments in Edit Graph Every path in the edit graph corresponds to alignment: ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 43 / 77

Alignments in Edit Graph ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 44 / 77

Alignments in Edit Graph ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 45 / 77

Alignment: Dynamic Programming s i 1,j 1 + 1 s i,j = max s i 1,j s i,j 1 if v i = w j Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 46 / 77

Dynamic Programming Example Initialize 1st row and 1st column to be all zeroes Or, to be more precise, initialize 0 th row and 0 th column to be all zeroes ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 47 / 77

Dynamic Programming Example s i 1,j 1 : value from NW +1 if v s i,j = max i = w j s i 1,j : value from N s i,j 1 : value from W ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 48 / 77

Alignment: Backtracking Arrows indicate where the score originated from: Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 49 / 77

Backtracking Example Find a match in row and column 2 i=2, j=2, 5 is a match (T). j=2, i=4, 5, 7 is a match (T). Since v i = w j, s i,j = s i 1,j 1 + 1 s 2,2 = [s 1,1 = 1] + 1 s 2,5 = [s 1,4 = 1] + 1 s 4,2 = [s 3,1 = 1] + 1 s 5,2 = [s 4,1 = 1] + 1 s 7,2 = [s 6,1 = 1] + 1 ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 50 / 77

Dynamic Programming Example Continuing with the dynamic programming algorithm gives this result. ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 51 / 77

Alignment: Dynamic Programming s i 1,j 1 + 1 s i,j = max s i 1,j s i,j 1 if v i = w j Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 52 / 77

Alignment: Dynamic Programming s i 1,j 1 + 1 s i,j = max s i 1,j s i,j 1 if v i = w j This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 52 / 77

LCS Algorithm 1 LCS(v,w) 2 for i 1 to n 3 s i,0 0 4 for j 1 to m 5 s 0,j 0 6 for i 1 to n 7 for j 1 to m 8 9 10 return s i 1,j 1 + 1 s i,j = max s i 1,j s i,j 1 b i,j = if v i = w j if s i,j = s i 1,j if s i,j = s i,j 1 if s i,j = s i 1,j 1 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 53 / 77

Now What? LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from sink ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 54 / 77

Printing LCS: Backtracking 1 PrintLCS(b,v,i,j) 2 if i = 0 or j = 0 3 return 4 if b i,j = 5 PrintLCS(b,v,i-1,j-1) 6 print v i 7 else 8 if b i,j = 9 PrintLCS(b,v,i-1,j) 10 else 11 PrintLCS(b,v,i,j-1) Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 55 / 77

Now What? Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 56 / 77

Now What? Alignment: A T C G - T A C - A T - G T T A - T ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 56 / 77

LCS Runtime It takes O(nm) time to fill in the n m dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested for" loop inside of another for" loop to set up a n m matrix. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 57 / 77

Dynamic Programming Part II: Sequence Alignment Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 58 / 77

Outline Dot plots Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 59 / 77

Types of sequence alignments Dot plots Number of sequences pairwise: compares two sequences multiple: compares several sequences Portion of aligned sequence global: aligns the sequences over all their length local: finds subsequences with the best similarity scores Algorithms Optimal methods: Needleman-Wunsch, Smith-Waterman Heuristics: FASTA, BLAST Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 60 / 77

Dot plot the simplest way to visualize the similarity between two protein/dna sequences is to use a similarity matrix Identification of insertions/deletions Identification of direct repeats or inversions Steps to create a dot plot Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 61 / 77

Dot plot the simplest way to visualize the similarity between two protein/dna sequences is to use a similarity matrix Identification of insertions/deletions Identification of direct repeats or inversions Steps to create a dot plot 2D matrix One sequence on the top One sequence on the left For each matrix cell, compare the symbols and draw a point if there is a coincidence Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 61 / 77

Dot matrix sequence comparison A dot matrix analysis is primarily a method for comparing two sequences. An (n m) matrix relating two sequences of length n and m respectively is produced by placing a dot at each cell for which the corresponding symbols match. Here is an example for the two sequences: IMISSMISSISSIPPI and MYMISSISAHIPPIE: ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 62 / 77

Dot matrix sequence comparison Definition Let S = s 1 s 2... s n and T = t 1... t m be two strings of length n and m respectively. Let M be an n m matrix. Then M is a (simple) dot plot if for i, j, 1 i n, 1 j m : M[i, j] = { 1 for si = t j 0 else. Note: The longest common substring within the two strings S and T is then the longest matrix subdiagonal containing only 1s. However, rather than drawing the letter 1 we draw a dot. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 63 / 77

Dot matrix sequence comparison Some of the properties of a dot plot are the visualization is easy to understand it is easy to find common substrings, they appear as contiguous dots along a diagonal it is easy to find reversed substrings it is easy to discover displacements it is easy to find repeats Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 64 / 77

Dot plots Sequence length: n and m O(nm) DNA Protein Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 65 / 77

Noise in Dot plot LDL human receptor compared to himself The low density lipoprotein (LDL) receptor is a cell surface protein that plays a central role in the metabolism of cholesterol in humans and animals. Mutations affecting its structure and function give rise to one of the most prevalent human genetic diseases, familial hypercholesterolemia. ioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 66 / 77

Reducing the noise To reduce the noise, a window size w and a stringency s are used and a dot is only drawn at point (x, y) if in the next w positions at least s characters are equal. Example: Phage P22 w = 1, s = 1 w = 11, s = 7 w = 23, s = 15 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 67 / 77

Random similarity in Dot plots When comparing DNA, there is a 1 4 probability of random matches When comparing protein sequences there is a 1 20 probability of random matches Hence, if coding DNA regions are analized: translate first, then align! You can always go back to DNA after alignment Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 68 / 77

w=1 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 69 / 77

w=3 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 70 / 77

w=3, stringency 2 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 71 / 77

DNA sequence Simple dot plot, w = 1 w = 23, stringency = 16 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 72 / 77

Protein sequence Simple dot plot, w = 1 w = 23, stringency = 6 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 73 / 77

Insertion Deletion & Inversion Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 74 / 77

Repeats ABCDEFGEFGHIJKLMNO Tandem duplication Compared to non duplicated Tandem duplication Compared to itself Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 75 / 77

Palindromic repeat (intra chain) 5 GGCGG 3 Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 76 / 77

Limitations of dot plots No score to quantify identical or similar strings Runtime is quadratic; more efficient algorithms to identify identical substrings exist (eg. based on suffix trees) Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 77 / 77