Basics of Multiple Sequence Alignment

Similar documents
Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Supplementary Online Material PASTA: ultra-large multiple sequence alignment

Multiple sequence alignment. November 20, 2018

Sequence alignment algorithms

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Stephen Scott.

Lecture 5: Multiple sequence alignment

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation

Chapter 6. Multiple sequence alignment (week 10)

Evolutionary tree reconstruction (Chapter 10)

Multiple sequence alignment. November 2, 2017

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Scaling species tree estimation methods to large datasets using NJMerge

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

PASTA: Ultra-Large Multiple Sequence Alignment

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Multiple Sequence Alignment Gene Finding, Conserved Elements

Multiple Sequence Alignment II

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Brief review from last class

Alignment ABC. Most slides are modified from Serafim s lectures

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

3.4 Multiple sequence alignment

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Biology 644: Bioinformatics

Multiple Sequence Alignment (MSA)

Biology 644: Bioinformatics

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

Multiple Sequence Alignment. Mark Whitsitt - NCSA

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Lecture 10. Sequence alignments

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Sequence analysis Pairwise sequence alignment

Introduction to Triangulated Graphs. Tandy Warnow

Multiple Sequence Alignment

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

On the Efficacy of Haskell for High Performance Computational Biology

EECS730: Introduction to Bioinformatics

Tutorial: Phylogenetic Analysis on BioHealthBase Written by: Catherine A. Macken Version 1: February 2009

Introduction to Trees

MULTIPLE SEQUENCE ALIGNMENT

The Simplex Algorithm

Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Event-based Phylogeny Inference and Multiple Sequence Alignment

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Computational Molecular Biology

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Biological Sequence Matching Using Fuzzy Logic

Sequence Alignment 1

Sequence comparison: Local alignment

A multiple alignment tool in 3D

Dynamic Programming for Phylogenetic Estimation

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data

CS 581. Tandy Warnow

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Pairwise Sequence Alignment. Zhongming Zhao, PhD

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

A Layer-Based Approach to Multiple Sequences Alignment

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

One report (in pdf format) addressing each of following questions.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Special course in Computer Science: Advanced Text Algorithms

Chordal Graphs and Evolutionary Trees. Tandy Warnow

Recent Research Results. Evolutionary Trees Distance Methods

BLAST & Genome assembly

DNA Alignment With Affine Gap Penalties

Research Article Aligning Sequences by Minimum Description Length

Computational Molecular Biology

Inverse Sequence Alignment from Partial Examples

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Finding data. HMMER Answer key

Machine Learning Techniques for Bacteria Classification

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Sequence alignment theory and applications Session 3: BLAST algorithm

Visual Representations for Machine Learning

Mismatch String Kernels for SVM Protein Classification

Leaning Graphical Model Structures using L1-Regularization Paths (addendum)

Introduction to Computational Phylogenetics

IMPROVING THE QUALITY OF MULTIPLE SEQUENCE ALIGNMENT. A Dissertation YUE LU

A New Approach For Tree Alignment Based on Local Re-Optimization

Blast2GO User Manual. Blast2GO Ortholog Group Annotation May, BioBam Bioinformatics S.L. Valencia, Spain

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

PFstats User Guide. Aspartate/ornithine carbamoyltransferase Case Study. Neli Fonseca

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

Transcription:

Basics of Multiple Sequence Alignment Tandy Warnow February 10, 2018

Basics of Multiple Sequence Alignment Tandy Warnow

Basic issues What is a multiple sequence alignment? Evolutionary processes operating on sequences Using profile HMMs to model multiple sequence alignment Optimization problems Basic techniques of standard methods Fundamental limitations of nearly all multiple sequence alignment methods How to evaluate alignments Performance studies of multiple sequence alignment methods operate on data

-HMM alignment strategies: the underlying alignments that A multiple sequence alignment rofile HMMs are not modified during the process. final multiple sequence alignment A of {s 1,s 2,...,s 8 } obta d A 2 from Table 9.7, computing their profiles P 1 and P 2 (see s 1 - - - T A C s 2 - - A T A C s 3 C - A - - G s 4 C - A A T G s 5 C - - T - G s 6 C T - - A C s 7 C - A T A C s 8 G - A - A T

Homology Two letters in two sequences are homologous if they descend from a letter in a common ancestor. Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true mul*ple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Evolutionary multiple sequence alignment seeks to create a matrix in which the input sequences are the rows and each column has letters that are all homologous to each other.

Evolutionary processes operating on sequences Substitutions Insertions and deletions of strings (indels) Rearrangements (inversions and transpositions) Duplications of regions Note: most alignment methods stretch out sequences so that the line up well, and so only address substitutions and indels.

ith A 1 and A 2. Hence, during this profile-profile alignment the Building alignments a profile A 1 HMM and Afor 2. an This MSA is a property of profile-pr -HMM alignment strategies: the underlying alignments that For the MSA below, the standard technique for building profile rofile HMMs are not modified during the process. HMMs would use an insertion state for position 2 (because more than 50% of the sequences are gapped in that position). s 1 - - - T A C s 2 - - A T A C s 3 C - A - - G s 4 C - A A T G s 5 C - - T - G s 6 C T - - A C s 7 C - A T A C s 8 G - A - A T

Optimization criteria Sum-of-pairs (sum of edit distances on induced pairwise alignments) Tree alignment (sum of costs of edges) Maximum likelihood under a statistical model of sequence evolution All three are NP-hard, even if the tree is given.

Basic techniques Multiple sequence alignment methods generally use one or more of the following techniques to align a set S of sequences: Align all sequences in S to a single sequence s or to a profile HMM (or some other model) Progressive alignment: compute a guide tree, and then align sequences from the bottom up Consistency: infer support for homology between two letters using third sequences Divide-and-conquer (especially based on a tree) Iteration between tree estimation and alignment estimation

Adding a sequence s to an alignment A We are given an alignment A and its profile HMM, H, and we are also given s = s 1 s 2... s n, which is homologous to the sequences in A. To add s to A, we: 1. Find the maximum likelihood path through H for s 2. Use that path to add s into A. Details for Step 2: Align s to the profile HMM, and note which states emit the letters of s. If letter s i is emitted by match state j, put s i in the column for this match state (might not be j). If letter s i is emitted by an insertion state, put s i in its own column after s i 1. (Don t put two letters in the same column if either is emitted by an insertion state.)

Aligning a set S of sequences Suppose S is a set of unaligned sequences and we are told they are all homologous (i.e., share a common evolutionary history) with the sequences in a family F. How shall we compute a multiple sequence alignment for S?

Aligning a set S of homologous sequences Compute an MSA A for the sequences in F. Build the profile HMM H for the alignment A. Add all the sequences in S to A, independently. The alignment produced will contain all the sequences of F S; you can then restrict to just the sequences in S.

Progressive Alignment Build a guide tree from the sequences Align the sequences from the bottom-up (aligning alignments as you go up) 21 2.3 Multiple sequence alignment Figure 2.7 Outline of the progressive alignment approach. For a given set of input sequences (a), a distance matrix is computed (b), and from this a phylogenetic guide tree is derived (c). Then, sequences at the leaves of the tree are aligned to produce profiles at the internal

Aligning alignments In a progressive alignment, alignments on disjoint sets are aligned together, to make an alignment on the combined set of sequences. To do this, the two alignments are first represented by profiles, and then these profiles are aligned to each other. This is performed using dynamic programming, similar to Needleman-Wunsch. Examples of methods that can align two alignments include Opal and Muscle. However, another approach is to represent each of the alignments as profile HMMs, and then align the two profile HMMs.

Using libraries of pairwise alignments, part 1 Suppose we have a set S of sequences, and a library L of pairwise alignments for the sequences in S. For each pair x, y of letters (one from each of two sequences), you have the frequency with which the two letters are aligned in L (i.e., the support for the homology pair x, y). Given a library of pairwise alignments, we can define the support for all homology pairs, and then seek the best MSA for these support values.

Using libraries of pairwise alignments, part 2 The consistency technique is another way of using a library L of pairwise alignments. Another way of defining the support for the homology pair is the number of letters z (in a third sequence) such that x and z and y and z are aligned in some pairwise alignments in P. This is how consistency is defined support via a third sequence. Many of the best MSA methods (e.g., T-Coffee and ProbCons) use the consistency technique in some way, and differ mainly in how they construct the library.

Statistical alignment estimation Some alignment methods are based on explicit parametric models of sequence evolution that include insertions and deletions (indels) as well as substitutions. Examples: BAli-Phy StatAlign Prank PAGAN These likelihood-based methods tend to be more computationally intensive than standard MSA methods, but have appealing statistical properties.

Divide-and-conquer using trees, cont. The main objective of divide-and-conquer is to scale good MSA methods to larger datasets, so that they are more accurate or can analyze larger datasets. Build a guide tree (perhaps by computing pairwise edit distances and then a tree based on the distances) Divide sequence dataset into disjoint subsets using the guide tree Align subsets 9.13 Co-estimation of alignments and trees 217 Align alignments together (e.g., profile-profile alignment)

One PASTA iteration

Iteration between MSA and tree estimation

How to evaluate methods Given an estimated and true (or reference alignment), we can compute various statistics, many of which are based on homology pairs : SPFN: sum of the false negative homology pairs SPFP: sum of the false positive homology pairs TC: total column score Compression: ratio of the estimated alignment length to true alignment length Distance between gap length distributions

Issues to consider Most methods can only handle indels and substitutions (i.e., no rearrangements or duplications). Most methods assume full-length sequences. Statistical methods are all based on models of sequence evolution, and the models are limited. Mosts methods cannot analyze very large datasets. Evaluation of methods is tricky.