Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Similar documents
Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Sequence alignment algorithms

Algorithmic Approaches for Biological Data, Lecture #20

Lecture 10. Sequence alignments

Brief review from last class

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Sequence analysis Pairwise sequence alignment

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Biology 644: Bioinformatics

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Computational Genomics and Molecular Biology, Fall

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

BLAST MCDB 187. Friday, February 8, 13

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

EECS730: Introduction to Bioinformatics

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Bioinformatics explained: Smith-Waterman

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Sequence Alignment. part 2

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Dynamic Programming & Smith-Waterman algorithm

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

Pairwise alignment II

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Multiple sequence alignment. November 20, 2018

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Sequence comparison: Local alignment

Alignment of Long Sequences

Alignment ABC. Most slides are modified from Serafim s lectures

Chapter 6. Multiple sequence alignment (week 10)

Pairwise Sequence alignment Basic Algorithms

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Bioinformatics for Biologists

Computational Molecular Biology

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Software Implementation of Smith-Waterman Algorithm in FPGA

Special course in Computer Science: Advanced Text Algorithms

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Programming assignment for the course Sequence Analysis (2006)

Computational Molecular Biology

Mouse, Human, Chimpanzee

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Notes on Dynamic-Programming Sequence Alignment

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Distributed Protein Sequence Alignment

From Smith-Waterman to BLAST

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Central Issues in Biological Sequence Comparison

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

EECS730: Introduction to Bioinformatics

BGGN 213 Foundations of Bioinformatics Barry Grant

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Biological Sequence Matching Using Fuzzy Logic

Sequence Alignment. Ulf Leser

Alignment of Pairs of Sequences

A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm

Research Article Aligning Sequences by Minimum Description Length

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Similarity searches in biological sequence databases

Multiple Sequence Alignment (MSA)

Long Read RNA-seq Mapper

On the Efficacy of Haskell for High Performance Computational Biology

Multiple Sequence Alignment. Mark Whitsitt - NCSA

BLAST & Genome assembly

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Basics of Multiple Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

Sequence Alignment. COMPSCI 260 Spring 2016

B I O I N F O R M A T I C S

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA

Fast Sequence Alignment Method Using CUDA-enabled GPU

Using Blocks in Pairwise Sequence Alignment

Stephen Scott.

Scoring and heuristic methods for sequence alignment CG 17

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

ABSTRACT MST BASED AB INITIO ASSEMBLER OF EXPRESSED SEQUENCE TAGS. By Yuan Zhang

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

Transcription:

Today s Lecture Edit graph & alignment algorithms Smith-Waterman algorithm Needleman-Wunsch algorithm Local vs global Computational complexity of pairwise alignment Multiple sequence alignment 1

Sequence alignments correspond to paths in a D! 2

The Edit raph for a Pair of Sequences C T T T C C C C T C 3

The edit graph is a D. Except on the boundaries, the nodes have in-degree and out-degree both 3. The depth structure is as shown on the next slide. Child of node of depth n always has depth n + 1 (for a horizontal or vertical edge), or depth n + 2 (for a diagonal edge). 4

Depth Structure C T T T C C C C T C 5

Paths in edit graph correspond to alignments of subsequences each edge on path corresponds to alignment column. diagonal edges correspond to column of two aligned residues; horizontal edges correspond to column with residue in 1 st (top, horizontal) sequence gap in the 2 d (vertical) sequence vertical edges correspond to column with residue in 2 d sequence gap in 1 st sequence 6

C T C C T T T C C C bove path corresponds to following alignment (w/ lower case letters considered unaligned): actttccca gct-c- 7

Weights on Edit raphs Edge weights correspond to scores on alignment columns. Highest weight path corresponds to highest-scoring alignment for that scoring system. Weights may be assigned using a substitution score matrix, and assigns a score to each possible pair of residues occurring as alignment column a gap penalty assigns a score to column consisting of residue opposite a gap. Example for protein sequences: BLOSUM62 8

BLOSUM62 Score Matrix P -12-2 R N D C Q E H I L K M F P S T W Y V B Z X * 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0-2 -1 0-4 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3-1 0-1 -4 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 3 0-1 -4 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 4 1-1 -4 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 -3-3 -2-4 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 0 3-1 -4 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 1 4-1 -4 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3-1 -2-1 -4 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 0 0-1 -4 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3-3 -3-1 -4 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1-4 -3-1 -4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 0 1-1 -4 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1-3 -1-1 -4 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 -3-3 -1-4 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 -2-1 -2-4 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 0 0 0-4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0-1 -1 0-4 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 -4-3 -2-4 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 -3-2 -1-4 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4-3 -2-1 -4 B -2-1 3 4-3 0 1-1 0-3 -4 0-3 -3-2 0-1 -4-3 -3 4 1-1 -4 Z -1 0 0 1-3 3 4-2 0-3 -3 1-1 -3-1 0-1 -3-2 -2 1 4-1 -4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1-1 -1-4 * -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4-4 -4 1 9

C T C C T T T C C C bove path corresponds to following alignment (w/ lower case letters considered unaligned): actttccca gct-c- 10

lignment algorithms Smith-Waterman algorithm to find highest scoring alignment = dynamic programming algorithm to find highestweight path Is a local alignment algorithm: finds alignment of subsequences rather than the full sequences. Can process nodes in any order in which parents precede children. Commonly used alternatives are depth order row order column order 11

If constrain path to start at upper-left corner node and extend to lower-right corner node, get a global alignment instead This sometimes called Needleman-Wunsch algorithm (altho original N-W alg treated gaps differently) variants which constrain path to start on the left or top boundary, extend to the right or bottom boundary. 12

Local vs. lobal lignments: Biological Considerations Many proteins consist of multiple domains (modules), some of which may be present with similar, but not identical sequence in many other proteins e.g. TP binding domains, DN binding domains, protein-protein interaction domains... Need local alignment to detect presence of similar regions in otherwise dissimilar proteins. Other proteins consist of single domain evolving as a unit e.g. many enzymes, globins. lobal alignment sometimes best in such cases... but even here, some regions are more highly conserved (more slowly evolving) than others, and most sensitive similarity detection may be local alignment. 13

C2-like domain domain found in other prenyltransferases disordered regions Leucine-rich repeat domain β subunit 3-D structures of rat Rab eranylgeranyl Transferase complexed with REP-1, + paralogs. adapted from Rasteiro and Pereira-Leal BMC Evolutionary Biology 2007 7:140 14

Multidomain architecture of representative members from all subfamilies of the mammalian RS protein superfamily. from www.unc.edu/~dsiderov/page2.htm 15

Similar considerations apply to aligning DN sequences: (semi-)global alignment may be preferred for aligning cdn to genome recently diverged genomic sequences (e.g. human / chimp) but local alignment often gives same result! between more highly diverged sequences, have rearrangements (or large indels) in one sequence vs the other, variable distribution of sequence conservation, & these usually make local alignments preferable. 16

Complexity For two sequences of lengths M and N, edit graph has (M+1)(N+1) nodes, 3MN+M+N edges, time complexity: O(MN) space complexity to find highest score and beginning & end of alignment is O(min(M,N)) (since only need store node s values until children processed) space complexity to reconstruct highest-scoring alignment: O(MN) 17

For genomic comparisons may have M, N 10 6 (if comparing two large genomic segments), or M 10 3, N 10 9 (if searching gene sequence against entire genome); in either case MN 10 12. Time complexity 10 12 is (marginally) acceptable. speedups which reduce constant by reducing calculations per matrix cell, using fact that score often 0 (our program swat). still guaranteed to find highest-scoring alignment. reducing # cells considered, using nucleating word matches (BLST, or cross_match). Lose guarantee to find highest-scoring alignment. 18

The Edit raph for a Pair of Sequences C T T T C C C C T C 19

Multiple lignment via Dynamic Programming Higher dimension edit graph each dimension corresponds to a sequence; co-ordinates labelled by residues Each edge corresponds to aligned column of residues (with gaps). Can put arbitrary weights on edges; in particular, can make these correspond to probabilities under an evolutionary model (Sankoff 1975). implicitly assumes independence of columns Highest weight path through graph again gives optimal alignment 20

eneralization to Higher Dimension Each cell in 3-dimensional case looks like this: M V Each edge projects onto a gap or residue in each dimension, defining an alignment column; e.g. red edge defines V M 21