Approaches to Efficient Multiple Sequence Alignment and Protein Search

Similar documents
PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Lecture 5: Multiple sequence alignment

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Multiple sequence alignment. November 20, 2018

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

BLAST, Profile, and PSI-BLAST

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Comparison and Evaluation of Multiple Sequence Alignment Tools In Bininformatics

Lab 4: Multiple Sequence Alignment (MSA)

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Multiple Sequence Alignment (MSA)

Approaches to Efficient Multiple Sequence Alignment and Protein Search. Adrienn Szabó

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Basic Local Alignment Search Tool (BLAST)

Data Mining Technologies for Bioinformatics Sequences

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Bioinformatics explained: Smith-Waterman

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

A New Approach For Tree Alignment Based on Local Re-Optimization

Chapter 6. Multiple sequence alignment (week 10)

Special course in Computer Science: Advanced Text Algorithms

3.4 Multiple sequence alignment

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Bioinformatics for Biologists

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Multiple sequence alignment. November 2, 2017

Multiple Sequence Alignment. Mark Whitsitt - NCSA

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

Thesis book. Extension of strongly typed object-oriented systems using metaprograms István Zólyomi

Clustering of Proteins

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Long Read RNA-seq Mapper

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Lecture 5: Markov models

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Sequence alignment algorithms

Efficient Inference in Phylogenetic InDel Trees

CS313 Exercise 4 Cover Page Fall 2017

Software review. Shopping in the genome market with EnsMart

Geneious 5.6 Quickstart Manual. Biomatters Ltd

A Modified Vector Space Model for Protein Retrieval

EECS730: Introduction to Bioinformatics

Bio3D: Interactive Tools for Structural Bioinformatics.

Comparison of FP tree and Apriori Algorithm

Mapping Sequence Conservation onto Structures with Chimera

CS 581. Tandy Warnow

Multiple Sequence Alignment Using Reconfigurable Computing

Stephen Scott.

srna Detection Results

Order Preserving Triclustering Algorithm. (Version1.0)

Dependency detection with Bayesian Networks

Introduction to Computer Science

Text Mining. Representation of Text Documents

Biology 644: Bioinformatics

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Environmental Sample Classification E.S.C., Josh Katz and Kurt Zimmer

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM]

User Manual. Ver. 3.0 March 19, 2012

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA

ProDA User Manual. Version 1.0. Algorithm developed by Tu Minh Phuong, Chuong B. Do, Robert C. Edgar, and Serafim Batzoglou.

Integrating Probabilistic Reasoning with Constraint Satisfaction

Introduction to Bioinformatics Online Course: IBT

Technical University of Munich. Exercise 8: Neural Networks

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

PyMod 2. User s Guide. PyMod 2 Documention (Last updated: 7/11/2016)

Multiple Sequence Alignment II

Skewed-Associative Caches: CS752 Final Project

Fractal Compression. Related Topic Report. Henry Xiao. Queen s University. Kingston, Ontario, Canada. April 2004

Heuristic methods for pairwise alignment:

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Network Traffic Classification by Common Subsequence Finding

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Finding data. HMMER Answer key

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

MetAmp: a tool for Meta-Amplicon analysis User Manual

PROJECT PERIODIC REPORT

On Generalizations and Improvements to the Shannon-Fano Code

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

Arbiter: the Evaluation Tool in the Contests of the China NOI

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Computational Molecular Biology

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

Two Phase Evolutionary Method for Multiple Sequence Alignments

Efficient, Scalable, and Provenance-Aware Management of Linked Data

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Biological Sequence Matching Using Fuzzy Logic

Sequence alignment theory and applications Session 3: BLAST algorithm

Research Article Aligning Sequences by Minimum Description Length

Transcription:

Approaches to Efficient Multiple Sequence Alignment and Protein Search Thesis statements of the PhD dissertation Adrienn Szabó Supervisor: István Miklós Eötvös Loránd University Faculty of Informatics Ph.D. School of Computer Science Erzsébet Csuhaj-Varjú D.Sc. Ph.D. Program: Basics and methodology of Informatics János Demetrovics D.Sc. Budapest, 2016.

Introduction The science of biology offers many great opportunities to a data scientist and a software developer to take part in the challenges of making sense of the gargantuous amount of data being gathered constantly. In my dissertation I focus on two topics: the NP-hard problem of aligning biological sequences [?], and searching protein families in a large database. The latter is more close to a traditional data mining task performed over biological data, while sequence alignment is a very challenging and active research subfield within bioinformatics in itself. After brief introductions to bioinformatics and data mining of the first chapter, Chapter 2 explores previously existing solutions and techniques for aligning proteins and searching protein databases, by reviewing the related scientific literature. Chapter 3 and Chapter 4 introduce new results in the topic of multiple sequence alignment: of the two, the first is based on [RA 1], and it describes how we implemented a new variant of cornercutting methods; while the second introduces a new representation for storing and carrying out statistical computations over a set of multiple sequence alignment paths. The related publications are [ER 1] and [ER 2]. Chapter 5 shows how we tackled the problem of finding a needle in a haystack within a biological context: the task was to find protein families exhibiting some pre-defined complex features in a large protein database. The main goal of the research-and-development projects behind my thesis was to discover new methodologies and techniques that could aid biologist researchers in their daily work in regards of acquiring new insights by automating the process of extracting useful information of the large amounts of data. The developed solutions were tested on real datasets and validated in practice. All the related software codes are accessible as open source projects, and I put great emphasis on re-usability in designing the related codebases. New Results 1. An efficient corner-cutting algorithm for multiple sequence alignment: Reticular Alignment Usually, corner-cutting methods define a compact, mostly convex part of the dynamic programming table for narrowing down the search for the best scored multiple alignment. We introduced a new progressive alignment method called Reticular Alignment, which obtains a set of optimal and suboptimal alignments at each step of the progressive alignment procedure. This set of alignments is 1

represented by a DAG network (a directed acyclic graph) of alignment columns. The novelty of our algorithm, introduced in [RA 1] is that we do not restrict the search for solutions to a compact part of the dynamic programming table, but instead, we use a special data structure (see Fig. 1) for both representing the alignments and aligning the set of alignments against another set of alignments. The common parts of the alignments are represented only once, and aligned only once, thus saving a large amount of memory and running time. Figure 1: A small example of a reticular alignment network: it shows three different alignments of the sequences ALLGVGQ and AVGQ This novel corner-cutting approach allows the efficient search of the space of multiple sequence alignments for better alignments. The method has a parameter, t, which affects how much of the alignment space is explored. The Reticular Alignment method can be combined with any scoring scheme, and in this way, we were able to infer the relative importance of sophisticated scoring schemes versus more exhaustive searches in the alignment space. The conclusion is that it is important to increase the search space for finding high-scored alignments. The accuracy of the final alignments was tested on the BAliBASE database [?]. We compared our method with the most popular alignment tools: ClustalW, MAFFT, and FSA. For a comparison of alignment accuracy, see Figure 2, and for more details, refer to Figure 3.2 in Chapter 3 of the dissertation. We found that combining sophisticated scoring schemes with the Reticular Alignment progressive alignment approach yielded a method whose accuracy is comparable to that of cutting-edge alignment methods, Clustal, MAFFT and FSA. It should be noted that this procedure of explicitly tracing the possible alignment paths as a means of a very strict and selective corner cutting method can not be easily enhanced, unless some dependencies between the alignment columns would be taken into account, but that would require much more data and intelligent algorithms to be trained on it. 2

Figure 2: Comparison of alignment software on BAliBASE v2.01 Refs 1-5. 2. Increasing the reticular threshold parameter in the Reticular Alignment method does not guarantee a better alignment The Reticular Alignment method usually could find more accurate alignments when itstparameter was increased to explore more of the space of possible alignments; although we did find some interesting cases when the final MSA score decreased with a higher value oft, see Figure 3. There can be be two fundamentally different reasons for why a widened search may result in worse final alignments: either the better scored alignments are less accurate (the wider search found better scored alignments, but these alignments happen to be differing more from the BAliBASE benchmark) or because of the local decisions at the guide tree nodes it may happen that more noise (meaning wrong sub-alignments) are propagated upwards, and a locally seemingly better alignment can be worse at the final alignment at the root. For a detailed explanation, see Fig. 4. 3

Figure 3: Dependency of alignment accuracy (all-column SP) on the reticular threshold Alignment accuracy achieved by RetAlign for different reticular threshold values. Accuracy here is measured as the mean all-column SP score on BAliBASE v2.01 Reference sets 1 5. RetAlign was run with sequence weighting on and pairwise indel scoring. 3. A new representation for storing and carrying out statistical computations over a set of multiple sequence alignment paths Representing a set of alignments as a DAG (directed acyclic graph) over alignment columns as nodes provides a way to quickly compute a distribution of alignment paths in the space of possible multiple sequence alignments, and other quantities of interest averaged over alignments [ER 1]. Joining a large collection of alignment paths in our experiments, up to a couple of thousands of them into a DAG network allows for downstream inference to be averaged over a substantive sample of alignments. Due to interchanges and crossovers at the common alignment columns, the number of alignments encoded in the network is usually a couple of orders of magnitude larger than the number of original alignment paths used to generate the DAG, so the effective sample size is greatly increased. It is also possible to weight each alignment according to a more reliable estimate of the posterior probability, rather than analysing only a small set of individual samples. Note, that many standard algorithms that were designed to work on single alignments for 4

Figure 4: Explaining how the internal score might decrease with the reticular threshold. In this particular example s a is the score of the best alignment at internal node v a, with t 1 as the threshold value. At an upper level, namely, at node v b, an x b,1 -network is generated (score: s b,1 ) based on the x a,1 -network. Then the best alignment at the root has score s r,1. After increasing the value of t to t 2, the best alignment at node v a is still the same, but the network of suboptimal alignments to consider at higher levels is the widerx a,2 > x a,1 network. From this network, it is possible to find a better scored alignment at node v b, with score s b,2 > s b,1. If s b,2 x b,2 > s b,1 then the x b,2 -network might be so different from x b,1, that they do not contain any common alignments. Therefore, the best alignment at the root obtained from thex b,2 -network might have a scores r,2 < s r,1. example, forward-backward algorithms for HMMs can be fairly easily adapted so that they can handle alignment DAGs as well. In [ER 1] and [ER 2], we presented a general framework for dealing with alignment uncertainty: on the one hand we explained the statistical background, while on the other hand, we tested our ideas on real data, by implementing the framework in Java as part of the WeaveAlign package. 4. A new protocol for finding re-occurring motifs in disordered regions of proteins Suspecting a role in the development of Alzheimer s disease, we were looking for examples of non-position specific conservation of an amino acid pattern (HD) within a certain distance of 5

transmembrane (TM) domains of TM proteins. I worked with the UniProt/SwissProt database, containing over half a million of protein sequences and their annotations. The proposed protocol consist of the following, individually parameterizable steps, or levels: 1. Filtering transmembrane proteins 2. Cutting TM and extracellular fragments 3. Clustering protein fragments 4. Multiple sequence alignment of protein fragments and tree building for each cluster 5. Finding subtrees of the cluster-trees that include the HD pattern 6. Visualization and analysis of results I wrote two Java packages, the first of which (ProteinSearch) contains classes to handle the SwissProt files and perform the filtering, cutting and clustering of proteins; while the second (PhyTreeSearch) gives an easy-to-use solution to the problem of searching subtrees of a large evolutionary tree that adhere to some pre-defined properties (in terms of containing an amino acid pattern in the sequences, for example). Scripts and configuration files for running the Java programs according to the above pipeline can be found in a separate, third code repository (APP-HD-pattern-runner). By running the pipeline, I managed to find a handful of small sets of proteins that could be interesting to look at from this point on, a biologist needs to decide if they are worth to be further analyzed. This project is still work-in-progress after proving the biological relevance of the findings, we plan to submit a methodology paper to BMC Bioinformatics in 2016. Summary, availability of software Ideally, all software tools used and developed during a research process should be open software. Usually at least some new code fragments or scripts have to be written to carry out new research experiments. Our Java software packages runnable jar files and java sources for multiple sequence alignment methods of Chapters 3 and 4 are openly available at the following URLs: http://phylogeny-cafe.elte.hu/retalign/retalign-0.22a.zip http://statalign.github.io/weavealign/downloads.html https://github.com/statalign/weavealign In Chapter 5, while implementing a pipeline for filtering protein fragments, I used two complementing methodologies / tools to ensure easy reproducibility of the final results: 6

I shared my working environment through a Docker virtual image I published all the software codes (java program sources and scripts, along with example configuration files) on GitHub, as open source projects under Apache License 2.0 1 and Unilicense 2 : https://github.com/ador/app-hd-pattern-runner https://github.com/ador/proteinpatternsearch https://github.com/ador/phytreesearch 1 http://www.apache.org/licenses/license-2.0 2 http://unlicense.org 7

Author s related publications [RA 1] Adrienn Szabó, Ádám Novák, István Miklós, and Jotun Hein: Reticular alignment: A progressive corner-cutting method for multiple sequence alignment BMC Bioinformatics (IF: 3.028), vol 11, page 570, 2010 URL: http://www.biomedcentral.com/1471-2105/11/570 DOI: 10.1186/1471-2105-11-570 [ER 1] Joseph L. Herman, Ádám Novák, Rune Lyngsø, Adrienn Szabó, István Miklós, and Jotun Hein: Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs BMC Bioinformatics (IF: 2.67), vol 16, number 1, 2015 URL: http://dx.doi.org/10.1186/s12859-015-0516-1 DOI: 10.1186/s12859-015-0516-1 [ER 2] Joseph L. Herman, Adrienn Szabó, István Miklós, and Jotun Hein: Approximate statistical alignment by iterative sampling of substitution matrices arxiv, 2015 URL: http://arxiv.org/abs/1501.04986 8