EECS730: Introduction to Bioinformatics

Similar documents
MULTIPLE SEQUENCE ALIGNMENT

Lecture 10: Local Alignments

Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment

Multiple Sequence Alignment Gene Finding, Conserved Elements

EECS730: Introduction to Bioinformatics

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Lecture 5: Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Multiple Sequence Alignment

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Multiple Sequence Alignment (MSA)

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Chapter 6. Multiple sequence alignment (week 10)

Computational Genomics and Molecular Biology, Fall

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

A multiple alignment tool in 3D

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Computational Molecular Biology

Multiple sequence alignment. November 20, 2018

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

3.4 Multiple sequence alignment

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Pairwise Sequence alignment Basic Algorithms

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Multiple sequence alignment. November 2, 2017

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Biology 644: Bioinformatics

Sequence alignment algorithms

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Stephen Scott.

Multiple Sequence Alignment Using Reconfigurable Computing

Special course in Computer Science: Advanced Text Algorithms

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Sequence Alignment & Search

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Computational Molecular Biology

Lecture 10. Sequence alignments

Alignment of Pairs of Sequences

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Algorithmic Approaches for Biological Data, Lecture #20

Bioinformatics explained: Smith-Waterman

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Bioinformatics for Biologists

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Brief review from last class

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Sequence analysis Pairwise sequence alignment

Mouse, Human, Chimpanzee

Evolutionary tree reconstruction (Chapter 10)

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Multiple Sequence Alignment

Longest Common Subsequence

BLAST - Basic Local Alignment Search Tool

BLAST, Profile, and PSI-BLAST

Recent Research Results. Evolutionary Trees Distance Methods

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

Alignment ABC. Most slides are modified from Serafim s lectures

EECS730: Introduction to Bioinformatics

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

Basic Local Alignment Search Tool (BLAST)

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Sequence alignment theory and applications Session 3: BLAST algorithm

Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function

DynamicProgramming. September 17, 2018

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Multiple Sequence Alignment Augmented by Expert User Constraints

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Sequence Alignment. part 2

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Alignment of Long Sequences

Dynamic Programming: Edit Distance

Shortest Path Algorithm

Programming assignment for the course Sequence Analysis (2006)

FastA & the chaining problem

3/5/2018 Lecture15. Comparing Sequences. By Miguel Andrade at English Wikipedia.

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Transcription:

EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides adapted from Dr. Shaojie Zhang (University of Central Florida)

Multiple alignments Reveal evolutionary history (speciation-related mutations) Prediction of protein structure and protein function Determine consensus sequence for sequence assembly Generalization of the pairwise alignment algorithm

Multiple alignments https://s9.postimg.org/z7797xw2n/ng_2757_f3.jpg

Object function To maximize the conservation of the alignment columns The more conserved the columns, the better the alignment Three scoring functions to characterize the conservation of the columns Multiple Longest Common Sequence Entropy Sum-of-pair scores

Multiple Longest Common Subsequence A column is a match if all the letters in the column are the same AAA AAA AAT ATC Similar idea to the LCS problem formulation for pairwise alignment Only good for very similar sequences

Entropy Define frequencies for the occurrence of each letter in each column of multiple alignment (gap may be included into the alphabet) p A = 1, p T =p G =p C =0 (1 st column) p A = 0.75, p T = 0.25, p G =p C =0 (2 nd column) p A = 0.50, p T = 0.25, p C =0.25 p G =0 (3 rd column) Compute entropy of each column X A, T, G, C p log p X X AAA AAA AAT ATC

Entropy cont. Best case entropy A A A A 0 Worst case A T entropy G C 1 4 log 1 4 4( 1 4 2) 2 Entropy for a multiple alignment is the sum of entropies of its columns

Sum-of-pair score Every multiple alignment induces pairwise alignments Induces: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum-of-pair score cont. The alignment score for the multiple alignment is the sum of the alignment scores of all of its induced pairwise alignments Consider pairwise alignment of sequences a i and a j imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(a i, a j ) Sum up the pairwise scores for a multiple alignment: s(a 1,,a k ) = Σ i,j s*(a i, a j )

Sum-of-pair score cont. It can also be computed column-wise This is useful for dynamic programming algorithm that breaks the problem into smaller sub-problems a 1. a k ATG-C-AAT A-G-CATAT ATCCCATTT 1 A 1 Score=3 m G 1 Score = 1 2m A 1 Column 1 A C m G Column 3

How to compute optimal multiple alignment Extending the dynamic programming algorithm for pairwise alignment Recall what does the score mean for each entry in the 2D dynamic programming table for pairwise alignments What is the dimension of the multiple sequence alignment dynamic programming table and what should we store there?

How to compute optimal multiple alignment Each entry in the 2D DP table stores the best score for aligning the prefixes of the two sequences: The entry (i, j) stores alignment score between S1(0, i) and S2(0, j), where S1 and S2 are the two sequences being aligned. This can also be extended to multiple alignment case How many different combinations of prefixes alignment for n sequences? l1 * l2 * * ln, where l is the length of a given sequence So the DP table for multiple alignment is an n-dimensional table It degenerates to 2D table for pairwise alignment

How to compute optimal multiple alignment Now, how many entries do we need to refer to in order to compute the score of an entry? Recall that each stage of the DP algorithm append one or zero character of each sequence to the existing alignment There are two choices (one or zero character), and there are n sequences in total, so there are 2^n entries to refer in total More precisely, we do not allow a column of all gaps, which means the combination of all zeros is invalid, and it reduces the number of entries to refer to as 2^n 1 For pairwise alignments, we need to refer to 2^2-1 = 3 entries in the DP table (left, up, and upper-left)

An example for aligning three sequences A three-dimensional Manhattan Tourist Problem The DP matrix is 3D We aims at finding the path that corresponds to the best alignment source sink

Filling the DP matrix In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube

Architecture of the 3D alignment cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) (i,j-1,k-1) (i,j,k-1) (i,j-1,k) (i,j,k)

Recursive function for MSA s i,j,k = max s i-1,j-1,k-1 + (v i, w j, u k ) s i-1,j-1,k + (v i, w j, _ ) s i-1,j,k-1 + (v i, _, u k ) s i,j-1,k-1 + (_, w j, u k ) s i-1,j,k + (v i, _, _) s i,j-1,k + (_, w j, _) s i,j,k-1 + (_, _, u k ) cube diagonal: no indels face diagonal: one indel edge diagonal: two indels S(x, y, z) is an entry in the 3-D scoring matrix (x, y, z) can be computed as the sum-of-pair score

What is the time complexity? We have l^n entries to fill, filling each entry takes 2^n time The overall complexity is O(l^n * 2^n) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to n sequences but it is impractical due to exponential running time.

Progressive multiple sequence alignment Perform all-against-all pairwise alignments for the n sequences Choose most similar pair of strings and combine into a profile, thereby reducing alignment of n sequences to an alignment of n-1 sequences/profiles. Repeat until 1 sequence/profile remains This is a heuristic greedy method u 1 = ACGTACGTACGT u 1 = ACg/tTACg/tTACg/cT n u 2 = TTAATTAATTAA u 3 = ACTACTACTACT u 2 = TTAATTAATTAA u n = CCGGCCGGCCGG n-1 u n = CCGGCCGGCCGG

Completing the iteration We need to find the most similar pair of strings at each iteration Therefore we need to redefine the similarity between the newly summarized profile and the other strings/profiles Using the Neighbor Joining Algorithm https://en.wikipedia.org/wiki/neighbor_joining Here, u is the new profile, and f and g are the two strings/profiles that form u, and k is an arbitrary string/profile remains

Completing the iteration Profile representation - A G G C T A T C A C C T G T A G C T A C C A - - - G C A G C T A C C A - - - G C A G C T A T C A C G G C A G C T A T C G C G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4 A single sequence can be viewed as a special case profile

Aligning two profiles Two profiles represented using frequencies can be aligned using slightly modified pairwise sequence alignment algorithm The score for matching two columns can be computed as the sum-ofpair scores of the two columns Affine gap penalty can be easily incorporated

Example http://statweb.stanford.edu/~nzhang/345_web/sequence_slides3.pdf

The guide tree For computation of the distances see https://en.wikipedia.org/wiki/ Neighbor_joining http://statweb.stanford.edu/~nzhang/345_web/sequence_slides3.pdf

Progressive alignment based on the guide tree http://statweb.stanford.edu/~nzhang/345_web/sequence_slides3.pdf

Time complexity for progressive multiple alignment N^3 At each stage, we need to re-compute the distance between the newly formed profile and the other N (in the worst case) sequences/profiles (linear) At each stage, we also need to perform pairwise alignment (square) Taken together, each stage requires square time We have N stages because each stage we reduce the size of set of sequences/profiles by 1 So N^3 in total

ClustalW: more sophisticated scoring function Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. Thompson et al. 1994 Nucleic Acids Research

ClustalW: residue-dependent gap penalty Thompson et al. 1994 Nucleic Acids Research

ClustalW: output http://statweb.stanford.edu/~nzhang/345_web/sequence_slides3.pdf

ClustalW: multiple sequence alignment http://www.ebi.ac.uk/tools/ msa/clustalw2/ http://www.genome.jp/tool s/clustalw/