A multiple alignment tool in 3D

Similar documents
Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

EECS730: Introduction to Bioinformatics

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

A New Approach For Tree Alignment Based on Local Re-Optimization

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Chapter 6. Multiple sequence alignment (week 10)

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Stephen Scott.

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Biology 644: Bioinformatics

Computational Genomics and Molecular Biology, Fall

Multiple Sequence Alignment II

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Multiple Sequence Alignment Gene Finding, Conserved Elements

- G T G T A C A C

Sequence Alignment & Search

Basic Local Alignment Search Tool (BLAST)

Programming assignment for the course Sequence Analysis (2006)

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Brief review from last class

Multiple Sequence Alignment Using Reconfigurable Computing

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

Sequence analysis Pairwise sequence alignment

Bioinformatics explained: BLAST. March 8, 2007

Alignment ABC. Most slides are modified from Serafim s lectures

Algorithmic Approaches for Biological Data, Lecture #20

3.4 Multiple sequence alignment

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Lecture 5: Multiple sequence alignment

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Sequence alignment theory and applications Session 3: BLAST algorithm

Lab 4: Multiple Sequence Alignment (MSA)

Sequence alignment algorithms

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

BMI/CS 576 Fall 2015 Midterm Exam

Page 1.1 Guidelines 2 Requirements JCoDA package Input file formats License. 1.2 Java Installation 3-4 Not required in all cases

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Multiple sequence alignment. November 20, 2018

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Lecture 10. Sequence alignments

Scaling species tree estimation methods to large datasets using NJMerge

BLAST - Basic Local Alignment Search Tool

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Computational Molecular Biology

Basics of Multiple Sequence Alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Weighted Finite-State Transducers in Computational Biology

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Bioinformatics for Biologists

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Parsimony-Based Approaches to Inferring Phylogenetic Trees

EECS730: Introduction to Bioinformatics

Multiple Sequence Alignments

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Additional Alignments Plugin USER MANUAL

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Algorithms for Bioinformatics

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Lesson 13 Molecular Evolution

Special course in Computer Science: Advanced Text Algorithms

Multiple sequence alignment. November 2, 2017

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Heuristic methods for pairwise alignment:

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Biologically significant sequence alignments using Boltzmann probabilities

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

Network Traffic Classification by Common Subsequence Finding

A Layer-Based Approach to Multiple Sequences Alignment

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Single Pass, BLAST-like, Approximate String Matching on FPGAs*

CS313 Exercise 4 Cover Page Fall 2017

Distance-based Phylogenetic Methods Near a Polytomy

QNet User s Manual. Kristoffer Forslund. November 6, Quartet-based phylogenetic network reconstruction

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Alignment of Pairs of Sequences

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Rapid Neighbour-Joining

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

Transcription:

Outline Department of Computer Science, Bioinformatics Group University of Leipzig TBI Winterseminar Bled, Slovenia February 2005

Outline Outline 1 Multiple Alignments Problems Goal

Outline Outline 1 Multiple Alignments Problems Goal 2

Outline Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain

Outline Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain 4

Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain 4 Multiple Alignments Problems Goal

Multiple Alignments Multiple Alignments Problems Goal direct alignment of more then three sequences via dynamic programming not practicable using of heuristics to reduce complexity using approximate methods: progressive, iterative, statistical Typical framework of progressive alignment algorithms 1 determine pairwise distances of all sequences 2 calculate phylogenetic tree from the pairwise distances 3 calculate sequence weights according to their relationship 4 pairwise align sequences sequentially guided by tree

Multiple Alignments Problems Goal Problems with progressive alignments Problems not guaranteed to find optimal alignment ultimate alignment depends on calculated phylogenetic tree ultimate alignment depends on early alignment steps introduced gaps remain fixed during whole progressive alignment process loss of information when building up alignment a ga ga a ga ca a ga

Multiple Alignments Problems Goal Goal Goal increase information transfer from sequence to alignment NNAlign improve quality of introduced gaps find a more accurate description of underlying phylogenetic history progressive alignment method similar to ClustalW aligns both nucleic acid and amino acid sequences instead of aligning two profiles during progressive steps, NNAlign aligns three profiles simultaneously underlying phylogeny is not a tree but a network

Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain 4

The NNAlign method Overview 1 determine sequence distances by pair-wise alignment 2 build a phylogenetic network using Neighbor-Net 3 align sequences sequentially according to phylogenetic network align three sequences in each alignment step while not the final alignment, split up into two alignments Sequence A Alignment AB Sequence B Three Way Alignment Alignment ABC Splitting Process Alignment BC Sequence C

Reconstructing phylogenetic networks with Neighbor-Net 1 Neighbor-Net introduced by D. Bryant and V. Moulton to model reticulate evolution distance based, glomerative method similar to Neighbor Joining pairs nodes not immediately but waits until a node has been paired up a second time after all nodes are glomerated they were expanded result after expansion is a planar splits graph 1 D. Bryant, V. Moulton: Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks, Mol. Biol. Evol., 21(2), 255-265, 2004

Neighbor-Net: Agglomeration and Expansion a b c a b c f d e f e d a b a b c f f e d e 1 2 c d a b c f d e a b c f d e 3 x y x y b c d b c d 4 x z y 1 x v u 2 w t s r u w Agglomeration Expansion 1 begin with one node for each taxon 2 identify c,d as neighbors as well as e,f 3 identify f as neighbor of a as well as e 4 fusing a,e,f to new nodes x,y 1 expanding nodes y,z to u,w,v 2 expanding nodes v,x to r,s,t

Neighbor-Net example

Alignment order Getting alignment order out of phylogenetic network nodes in Neighbor-Net algorithm correspond to sequences every node fusion corresponds to a three-way alignment order of node fusion gives order of sequential alignments to keep framework consistent, alignment must be splitted up into two alignments (NeighborNet fuses three nodes to two nodes)

Affine gap penalties for pairwise sequence alignments c c c S(x,y) GO (gap open penalty) GE (gap extension penalty) not permitted gap x a a c align a a t gap y consider possibilities for gap opening or extending as finite state machine with three states: align, gap in first sequence, gap in second sequence state transitions are shown in diram

Quasi-natural gap costs for three sequences a c c c c c a c c a a a c a c a c a S(x,y)+S(x,z)+S(y,z) GO+S(a,b) GE+S(a,b) 2*GO 2*GE gap xz gap x not supported gap z align gap xy not all reasonable state transitions possible fast approach, suitable results gap yz gap y

Natural gap costs for three sequences a c c c c c a c c a a a c a c a c a c a a a a a c a c c a c a S(x,y)+S(x,z)+S(y,z) GO+S(a,b) GE+S(a,b) 2*GO 2*GE GO GE gap z gap xz align gap x gap xy all reasonable state transitions are possible higher computational effort necessary gap yz gap y

Adjusting gap penalties Global impacts length of three (groups of) sequences to be aligned length divergence of (groups of) sequences pairwise sequence identities scoring matrix (avere mismatch score) Position specific impacts number of sequences already containing gaps at this position distance from already introduced gaps presence of hydrophilic stretches in protein sequences residue specific gap penalties in protein sequences Sequence weighting not implemented so far!

Divide & Conquer 1 because of cubic complexity in sequence length, very large sequences cannot be aligned en bloc use a divide-and-conquer-recurrence if sequence length exceeds a given limit choice of slicing positions has strong impact on alignment quality 1 J. Stoye: Multiple Sequence Alignment with the Divide-and-Conquer Method, Gene 211(2), GC45-GC56, 1998. (Gene-COMBIS)

Calculating slicing positions Choice of optimal slicing positions c x, c y and c z? Guess: Slicing positions c x, c y and c z should lie on traceback path in dynamic programing algorithm. y j i x yellow: optimal path blue: optimal path running through (i, j) move (i, j) to optimal path Problem: optimal path not known

Additional pairwise cost matrix Definition For each pair of possible slicing positions (0 i S a, 0 j S b ) the additional cost for slicing sequence S a at position i and sequence S b at position j is given by the additional pairwise cost matrix C ab (i, j). y j i x y x darker regions mean higher additional costs optimal traceback path has lowest additional cost

Calculating slicing positions A G T 0 1 2 3 C T 1 1 2 3 2 2 2 2 A G T C T 2 1 1 2 2 1 0 1 3 2 1 0 A G T 0 0 1 3 C T 1 0 0 2 3 2 1 0 f D D b C Algorithm Given an additional score function ω: 1 Calculate for every pair a, b of sequences S x, S y, S z additional cost matrix C ab with: C ab (i, j) = Dab f (i, j) + Db ab (i, j) ωopt ab (0 i S a, 0 j S b ) 2 Set î = S x /2 fixed (assume that sequence x is longest) 3 Find j, k for which C xy (î, j) + C xz (î, k) + C yz (j, k) becomes minimal (0 j S y, 0 k S z )

Alignment splitting X Y Z X Y Z X Y Z XY YZ X Y Y Z Node Fusion Alignment XYZ result of three-way alignment of X, Y and Z split XYZ up into XY and YZ Y contained in both alignments XY and YZ iteratively delete sequences of Y either in XY or in YZ sequentially delete sequences from alignment that gains higher score after deletion

considerations Once per program run Algorithm step Time Space Calculating distances O(l 2 n 2 ) O(l 2 + n 2 ) Determing alignment order O(n 3 ) O(n) For every 3-way alignment step Calculate slicing positions O(n 2 l 2 ) O(l 2 ) D&C-Alignment (D&C) O(n 2 l L 2 ) O(L 3 ) Eliminating duplicates O(n 2 ) O(n) l: avere sequence length n: number of sequences L: divide & conquer length limit

Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain 4 Parameters Exon 1 sequences of HOX Globin Domain

Parameters Exon 1 sequences of HOX Globin Domain Compare NNAlign with ClustalW Parameter settings (default ClustalW settings): gap-open penalty: 10.0 gap-extension penalty: 0.2 Protein scoring matrix: BLOSUM62 terminal gaps are not weighted

Parameters Exon 1 sequences of HOX Globin Domain : Exon 1 sequences of HOX C9/D9 ClustalW: Score=741.57

Parameters Exon 1 sequences of HOX Globin Domain : Exon 1 sequences of HOX C9/D9 NNAlign: Score=776.85 (104.8% of ClustalW)

Parameters Exon 1 sequences of HOX Globin Domain : Exon 1 sequences of HOX C9/D9 NNAlign (quasi-natural gap costs): Score=754.19 (natural gap costs: Score=776.85)

: Globin Domain Parameters Exon 1 sequences of HOX Globin Domain ClustalW: Score=151.89

Parameters Exon 1 sequences of HOX Globin Domain NNAlign: Score=158.38 (104.3% of ClustalW)

Outline 1 Multiple Alignments Problems Goal 2 3 Parameters Exon 1 sequences of HOX Globin Domain 4

and Outlook NNAlign alignments in many cases slightly better score than ClustalW alignment simultaneous alignment of three sequences has benefit natural gap costs increase alignment score compared to quasi-natural gap costs Outlook implement ability for sequence weighting optimize splitting process optimize running time and memory consumption debug...

Thank you for your attention!