Multiple sequence alignment. November 20, 2018

Similar documents
Multiple sequence alignment. November 2, 2017

Multiple Sequence Alignment. Mark Whitsitt - NCSA

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Lecture 5: Multiple sequence alignment

Stephen Scott.

Multiple Sequence Alignment: Multidimensional. Biological Motivation

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Multiple Sequence Alignment (MSA)

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Basics of Multiple Sequence Alignment

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Sequence alignment algorithms

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Chapter 6. Multiple sequence alignment (week 10)

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

IMPROVING THE QUALITY OF MULTIPLE SEQUENCE ALIGNMENT. A Dissertation YUE LU

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

EECS730: Introduction to Bioinformatics

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

3.4 Multiple sequence alignment

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS

BLAST, Profile, and PSI-BLAST

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Computational Genomics and Molecular Biology, Fall

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Evolutionary tree reconstruction (Chapter 10)

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Sequence alignment theory and applications Session 3: BLAST algorithm

Introduction to Bioinformatics Online Course: IBT

Approaches to Efficient Multiple Sequence Alignment and Protein Search

Computational Molecular Biology

Brief review from last class

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

PARAMETER ADVISING FOR MULTIPLE SEQUENCE ALIGNMENT

Data Mining Technologies for Bioinformatics Sequences

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Two Phase Evolutionary Method for Multiple Sequence Alignments

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

A New Approach For Tree Alignment Based on Local Re-Optimization

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

MULTIPLE SEQUENCE ALIGNMENT

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

PaMSA: A Parallel Algorithm for the Global Alignment of Multiple Protein Sequences

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Higher accuracy protein Multiple Sequence Alignment by Stochastic Algorithm

Bioinformatics explained: Smith-Waterman

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

spaces (gaps) in each sequence so that they all have the same length. The following is a simple example of an alignment of the sequences ATTCGAC, TTCC

Basic Local Alignment Search Tool (BLAST)

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Multiple Sequence Alignment Gene Finding, Conserved Elements

Local weighting schemes for protein multiple sequence alignment

Special course in Computer Science: Advanced Text Algorithms

Algorithmic Approaches for Biological Data, Lecture #20

Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment

Alignment ABC. Most slides are modified from Serafim s lectures

human chimp mouse rat

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

From [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Bioinformatics for Biologists

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Geometric data structures:

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Lecture 20: Clustering and Evolution

Recent Research Results. Evolutionary Trees Distance Methods

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Lecture 20: Clustering and Evolution

Biologically significant sequence alignments using Boltzmann probabilities

Multiple Sequence Alignment II

From Smith-Waterman to BLAST

A multiple alignment tool in 3D

Bioinformatics explained: BLAST. March 8, 2007

LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA

Multiple sequence alignment based on combining genetic algorithm with chaotic sequences

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Mapping Sequence Conservation onto Structures with Chimera

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

Data Walkthrough: Background

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

This exposition is based on the following sources, which are recommended reading: significant functional/structural similarity

CS313 Exercise 4 Cover Page Fall 2017

Chapter 2 Basic Structure of High-Dimensional Spaces

Transcription:

Multiple sequence alignment November 20, 2018

Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one sequence into another Create phylogenetic trees to determine which sequences are most closely related Highly conserved regions imply structural/functional information Group sequences into families -> functional information Sometimes the only way to align distant sequences

alignment reflects structure though sequences are not well conserved

True multiple sequence alignment Dynamic programming algorithms are too slow and in fact, cannot guarantee an optimal answer BUT it s interesting to see how they work The DP recursion is too big to write out but if you have the optimal sequence up to a point, the next step is to make the optimal move (gap must be considered for every single sequence, in all combinations)

dynamic multiple sequence alignment current alignment character to add

Time considerations For true MSA need to find sum-of-pair scores for all possible sequence pairs. Time complexity: O(2 N L N ) where N is the number of sequences and L is the sequence length (this assumes that the sequences are roughly the same length) BUT, in fact, MSA is NP-complete.

Carillo-Lipman algorithm (implemented by Lipman, Altschul and Kececioglu) Heuristic method, works well for a few sequences Idea: the number of cells in the n-dimensional alignment lattice is the product of all of the sequence lengths! this is way too big, so we can t examine every cell Need to find a way to reduce the number of cells examined and still get the right answer...

Carillo-Lipman algorithm The MSA path projects a pairwise path onto all sequence pairs. This means that we can calculate an upper bound for the cost of this projection; this upper bound limits the points through which the projection can pass

Carillo-Lipman algorithm A A B C B Lipman et al. wrote MSA, a multiple sequence alignment program using dynamic programming with heuristic modifications C Can align 6-8 sequences of 200-300 residues in length; practical limit is closer to 4-5 sequences

Progressive alignment methods Feng and Doolittle s idea: create pairwise alignments, use them to make an ad hoc tree, then use that tree to align the alignments into one big MSA Underlying assumption: closely related sequences will give the most robust alignments, so we should start with those (and the tree highlights which sequences those are)

Generic progressive alignment

CLUSTAL W original version released in 1994 Nice combination of algorithms and heuristics to create a very usable program. Still widely used today and is a benchmark for new programs (though it shouldn t be)

CLUSTAL W 1. Align all pairs of sequences separately to calculate distance matrix from divergence of each pair of sequences 2. Calculate guide tree from distance matrix 3. Align sequences progressively according to branching order in guide tree

CLUSTAL W

Progressive alignment once a gap, always a gap example:

CLUSTAL Omega Improvements: align sequences to randomly chosen subset first guide tree made by k-means clustering align along the guide tree until two alignments remain; use HMM to consolidate

CLUSTAL Omega Improvements, continued: Dynamically vary gap penalties by position and in a residue-specific manner 5 hydrophilic residues -> lower gap penalty by 1/3 Increase gap penalty less than 8 residues away from an existing gap Lower gap penalty in an existing gap Substitution matrices varied at difference alignment stages according to the divergence of the sequences to be aligned

Iterative MSA CLUSTAL W is a progressive algorithm Many methods use format of CLUSTAL W but then rearrange the alignment iteratively to improve its score

MAFFT Iterative multiple sequence alignment based on fast Fourier transform first published in 2002 cleverly restricts scope of dynamic programming used can incorporate other data (structure etc) while making the alignments

MAFFT first steps are progressive -- use kmer counting as first measure of similarity (k=6)

MAFFT then iterates to improve the trees & alignments

MAFFT allows addition of unaligned sequences into existing multiple sequence alignment has RNA-specific modes can incorporate structural information into scoring

bigger issues...

bigger issues...

PRANK: phylogeny-aware alignment length of alignment should grow with phylogenetic distance gaps are free after one gap is placed distinguishes insertions from deletions needs relatively large number of sequences for accurate phylogeny relatively slow

PRANK: phylogeny-aware alignment length of alignment should grow with phylogenetic distance gaps are free after one gap is placed distinguishes insertions from deletions needs relatively large number of sequences for accurate phylogeny relatively slow

still a VERY active field ~1-2 new algorithms published per month with more genomes, MSA is increasingly necessary and increasingly useful how to compare MSA algorithms?

curated databases of alignments BAliBASE SABmark PREFAB HOMSTRAD but these aren t as relevant now... new benchmarking methods published almost as often!

benchmarks simulated sequences consistency-based (compare to a bunch of MSA results) structure-based phylogeny-based

benchmarking

Billions of human-brain peta-flops of computation are wasted daily playing games that do not contribute to advancing knowledge.