Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

Similar documents
PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Multiple Sequence Alignment (MSA)

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Lecture 5: Multiple sequence alignment

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

EECS730: Introduction to Bioinformatics

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

BLAST - Basic Local Alignment Search Tool

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment II

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Multiple sequence alignment. November 20, 2018

Basic Local Alignment Search Tool (BLAST)

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

3.4 Multiple sequence alignment

Lab 4: Multiple Sequence Alignment (MSA)

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Biology 644: Bioinformatics

Chapter 6. Multiple sequence alignment (week 10)

MULTIPLE SEQUENCE ALIGNMENT

Comparison and Evaluation of Multiple Sequence Alignment Tools In Bininformatics

Sequence analysis Pairwise sequence alignment

Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Approaches to Efficient Multiple Sequence Alignment and Protein Search

ProDA User Manual. Version 1.0. Algorithm developed by Tu Minh Phuong, Chuong B. Do, Robert C. Edgar, and Serafim Batzoglou.

Higher accuracy protein Multiple Sequence Alignment by Stochastic Algorithm

Multiple sequence alignment. November 2, 2017

Sequence alignment algorithms

Basics of Multiple Sequence Alignment

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Stephen Scott.

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Special course in Computer Science: Advanced Text Algorithms

Heuristic methods for pairwise alignment:

Computational Genomics and Molecular Biology, Fall

A New Approach For Tree Alignment Based on Local Re-Optimization

Sequence alignment theory and applications Session 3: BLAST algorithm

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Special course in Computer Science: Advanced Text Algorithms

Two Phase Evolutionary Method for Multiple Sequence Alignments

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

A multiple alignment tool in 3D

CHAPTER 5 COMPARATIVE ANALYSIS WITH EXISTING ALGORITHMS

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

BLAST, Profile, and PSI-BLAST

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Research Article Aligning Sequences by Minimum Description Length

Clustering of Proteins

A Layer-Based Approach to Multiple Sequences Alignment

Multiple Sequence Alignment Gene Finding, Conserved Elements

Alignment ABC. Most slides are modified from Serafim s lectures

Bioinformatics explained: Smith-Waterman

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Geneious 5.6 Quickstart Manual. Biomatters Ltd

Chromatin immunoprecipitation sequencing (ChIP-Seq) on the SOLiD system Nature Methods 6, (2009)

Multiple Sequence Alignment Using Reconfigurable Computing

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

PaMSA: A Parallel Algorithm for the Global Alignment of Multiple Protein Sequences

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Bioinformatics explained: BLAST. March 8, 2007

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Additional Alignments Plugin USER MANUAL

Machine Learning Techniques for Bacteria Classification

BLAST MCDB 187. Friday, February 8, 13

PARAMETER ADVISING FOR MULTIPLE SEQUENCE ALIGNMENT

T-Coffee: Cheat Sheet Tutorial and FAQ. Cédric Notredame

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Single Pass, BLAST-like, Approximate String Matching on FPGAs*

Data Mining Technologies for Bioinformatics Sequences

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

EECS730: Introduction to Bioinformatics

CS313 Exercise 4 Cover Page Fall 2017

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Mismatch String Kernels for SVM Protein Classification

Small Libraries of Protein Fragments through Clustering

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Transcription:

Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Column cost: the sum of costs for all possible pairs Computational complexity Alignment of protein sequences with 200 amino acid residues: # of sequences CPU time 2 1 sec 3 200 sec 10 200 8 sec A correct multiple alignment corresponds to an evolutionary history: no correct way to determine practical way - to find an alignment with the maximum score Multiple sequence alignment Given k (k > 2) sequences, s 1,, s k, each sequence consisting of characters from an alphabet A multiple alignment is a a rectangular array, consisting of characters from the alphabet A (A + "-"), that satisfies the following 3 conditions: 1. There are exactly k rows. 2. Ignoring the gap character, row number i is exactly the sequence s i. 3. Each column contains at least one character different from "-". Consensus Consensus sequence - idealized sequence in which each position represents the amino acid most often found when many sequences are compared. Plurality - minimum number of votes for a consensus Threshold - scoring matrix value below which a symbol may not vote for a coalition. Sensitivity - minimum score to select consensus Profiles - blocks of prealigned sequences

algorithm algorithm 1. Pairwise alignments (progressive pairwise alignments) 2. Distance matrix calculation 3. Guide tree creation (hierarchical clustering) 4. New sequence addition Do and Katoh, 2008 algorithm algorithm Guide tree Progressive alignments Divide-and-conquer method Scoring system (distances) D(ij) = -ln S real (ij) - S rand (ij) S iden (ij) - S rand (ij) x 100 S real (ij) - observed similarity score for two aligned sequences i and j S iden (ij) - average of the two scores for each sequence aligned with itself S rand (ij) - average score determined from 100 global randomizations of the two sequences The distances D(ij) are used to generate the distance matrix from which the approximate guide tree is generated. Scoring systems Phylogenetic tree Star tree Sum of pairs

s comparison (1,0) (1,1) C (1,1,1) B B (0,0) A (0,1) (0,0,0) A Segment - line joining two vertices Each unit m-dimensional cube in the lattice contains 2 m -1 segments Alignment Path for 3 Sequences (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2) V S N - S - S N A - - - - A S Alignment statistics Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 478 67% 65% 19% 19% 18% 42% 43% 1 0 82% 80% 39% 39% 36% 64% 65% 0 1% 0% 5% 5% 12% 2% 2% 327 483 58% 16% 16% 16% 39% 41% 2 400 0 75% 38% 38% 35% 62% 63% 5 0 0% 5% 5% 12% 1% 1% 318 284 482 18% 18% 17% 40% 43% 3 390 367 0 38% 38% 35% 64% 64% 4 1 0 5% 5% 12% 1% 1% Pairwise Projections of the Alignment 1: The number of residues that match exactly (identical residues) between the two sequences. 2: The number of residues whose juxtaposition yields a greater than zero score in the current scoring table (similar residues or conservative substitutions). 3: The number of residues lined up with a gap character.

Alignment score Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 1 4077 2 5358 4129 3 5323 5650 4096 4 8103 8229 8112 4210 5 8109 8243 8118 4332 4219 6 8535 8672 8575 5511 5519 4261 7 6474 6531 6500 8103 8119 8572 4103 8 6392 6434 6378 8033 8035 8520 5508 4083 Alignment visualization M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQ M---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQ M---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQ M---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQ M---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ -----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQ MRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQ M---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQ Identity Summary view : 130 : 130 : 114 : 135 Alignment visualization M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQ M---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQ M---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQ M---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQ M---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ -----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQ MRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQ M---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQ Physico-chemical properties Alignment visualization (tree).---.g.la...ps-...a...tst.ealg.-...a...d....---.gtwa...gst..s...a.galgt-...t...d...r....---.ks..gp..p-t..g...lsi..tqgv-..a..v...d...k...s...---.l...vlt---.a..gnah..s...h-ea...pa.lvlnh.takv...---.l...vlt---.a..gn.h..s...h-ka...pa.lvln..takv.. -----------------------...A.Y-EA...PA.LVLN..TAKV.....RGPCNAP...S...IGTAV.A...V...Q...D..S...TA....---..RGPDTAR..AT...A.LGTAV..T-...A...Q...D..C...T.. Differences mode : a quantitative graphical display for binding sites and proteins Reference: Schneider, T.D. Meth. Enzym 274:445, 1996

GRK6_RAT P97711 ( 225) RKGEAMALNEKQILEKVNS 11 GRK6_MOUSE O70293 ( 225) RKGEAMALNEKQILEKVNS 11 O70294 ( 225) RKGEAMALNEKQILEKVNS 11 O70295 ( 225) RKGEAMALNEKQILEKVNS 11 O70296 O70296_MOUSE ( 225) RKGEAMALNEKQILEKVNS 11 P97548 P97548_RAT ( 225) RKGEAMALNEKQILEKVNS 11 P97549 P97549_RAT ( 225) RKGEAMALNEKQILEKVNS 11 GRK5_HUMAN P34947 ( 225) RKGESMALNEKQILEKVNS 12 GRK6_HUMAN P43250 ( 225) RKGEAMALNEKQILEKVNS 11 O60541 ( 225) RKGEAMALNEKQILEKVNS 11 GRK5_RAT Q62833 ( 225) RKGESMALNEKQILEKVNS 12 O70292 ( 225) RKGESMALNEKQILEKVNS 12 O70297 ( 225) RKGESMALNEKQILEKVNS 12 Q13293 ( 226) RKGEAMALNEKRILEKVQS 14 Q13294 ( 194) RKGEAMALNEKRILEKVQS 14 Q15314 ( 226) RKGEAMALNEKRILEKVQS 14 Q15316 ( 194) RKGEAMALNEKRILEKVQS 14 GRK5_BOVIN P43249 ( 225) RKGESMALNEKQILEKVNS 12 Q13295 ( 226) RKGEAMALNEKRILEKVQS 14 Q15313 ( 226) RKGEAMALNEKRILEKVQS 14 Q15315 ( 194) RKGEAMALNEKRILEKVQS 14 GRK4_HUMAN P32298 ( 194) RKGEAMALNEKRILEKVQS 14 Q14453 ( 226) RKGEAMALNEKRILEKVQS 14 GRK4_MOUSE O70291 ( 225) RKGEAMALNEKRILEKLHS 21 Multiple Alignment Programs C. Notredame, 2007 Multiple Alignment Programs PROGRAM ADVANTAGES CAUTIONS CLUSTALW Uses less memory than other programs Less accurate or scalable than modern programs DIALIGN MAFFT, MUSCLE Attempts to distinguish between and non- regions Faster and more accurate than CLUSTALW; good trade-off of accuracy and computational cost. PROBCONS Highest accuracy score on several benchmarks ProDA T-COFFEE Less accurate than CLUSTALW on global benchmarks For very large data sets (say, more than 1000 sequences) select time- and memory-saving options Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences) Does not assume global alignability; High computational cost and less allows repeated, shuffled and absent accurate than CLUSTALW on global domains. benchmarks High accuracy and the ability to incorporate heterogeneous types of information Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences) Input data 2 100 sequences of typical protein length (maximum around 10,000 residues) that are approximately globally 100 500 sequences that are approximately globally >500 sequences that are approximately globally Large numbers of alignments, high-throughput pipeline. Typical alignment tasks Recommendations Use PROBCONS, T-COFFEE, and MAFFT or MUSCLE, compare the results using ALTAVIST. Regions of agreement are more likely to be correct. For sequences with low percent identity, PROBCONS is generally the most accurate, but incorporating structure information (where available) via 3DCoffee (a variant of T-COFFEE) can be extremely helpful. Use MUSCLE or one of the MAFFT scripts with default options. Comparison using ALTAVIST is possible, but the results are hard to interpret with larger numbers of sequences unless they are highly similar. Use MUSCLE with a faster option (we recommend maxiters-2) or one of the faster MAFFT scripts Use MUSCLE with faster options (e.g. maxiters-1 or maxiters-2) or one of the faster MAFFT scripts Input data 2 100 sequences with conserved core regions surrounded by variable regions that are not 2 100 sequences with one or more common domains that may be shuffled, repeated or absent. A small number of unusually long sequences (say, >20,000 residues) Typical alignment tasks Recommendations Use DIALIGN Use ProDA Use CLUSTALW. Other programs may run out of memory, causing an abort (e.g. a segmentation fault).