Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Similar documents
Spectrum Graph approaches for de novo peptide identification

A Batch Import Module for an Empirically Derived Mass Spectral Database

This manual describes step-by-step instructions to perform basic operations for data analysis.

Agilent G2721AA Spectrum Mill MS Proteomics Workbench Quick Start Guide

De Novo Peptide Identification

QuantWiz: A Parallel Software Package for LC-MS-based Label-free Protein Quantification

Note: Note: Input: Output: Hit:

You will remember from the introduction, that sequence queries are searches where mass information is combined with amino acid sequence or

PEAKS Studio 5 User s Manual

PEAKS Studio 5.1 User s Manual

Efficient Processing of Models for Large-scale Shotgun Proteomics Data

MSFragger Manual. (build )

PEAKS 4.2 User s Manual

Rapid and Accurate Peptide Identification from Tandem Mass Spectra

Bioinformatics explained: BLAST. March 8, 2007

De Novo Pipeline : Automated identification by De Novo interpretation of MS/MS spectra

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Panorama Sharing Skyline Documents

Tutorial 2: Analysis of DIA/SWATH data in Skyline

Agilent G2721AA/G2733AA Spectrum Mill MS Proteomics Workbench

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

A Design of a Hybrid System for DNA Sequence Alignment

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:

PARALLEL COMPUTING ALGORITHMS FOR TANDEM MASS SPECTRUM ANALYSIS

Spectrum Mill B Creating Custom Modifications and Search Modes. A Guide for System Administrators

Near real-time processing of proteomics data using Hadoop. Hillman, Chris; Ahmad, Yasmeen; Whitehorn, Mark; Cobley, Andrew

EECS730: Introduction to Bioinformatics

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing. framework.

Bioinformatics explained: Smith-Waterman

Statistical Process Control in Proteomics SProCoP

Skyline Targeted Method Editing

Database Searching Using BLAST

BLAST & Genome assembly

Skyline Targeted Method Refinement

Multiple sequence alignment accuracy estimation and its role in creating an automated bioinformatician

TraceFinder Analysis Quick Reference Guide

Skyline Targeted Method Refinement

Package rtandem. June 30, 2018

Tutorial 7: Automated Peak Picking in Skyline

CS313 Exercise 4 Cover Page Fall 2017

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

ATAQS v1.0 User s Guide

Skyline High Resolution Metabolomics (Draft)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Computational Molecular Biology

Sequence Alignment & Search

Welcome to the MSI Cargill Computer Lab. Center for Mass Spectrometry and Proteomics Phone (612) (612)

Data Mining Technologies for Bioinformatics Sequences

Imports data from files created by Mascot. User chooses.dat,.raw and FASTA files and Visualize creates corresponding.ez2 file.

Skyline Targeted MS/MS

QuiC 1.0 (Owens) User Manual

Note: The MS/MSALL with SWATH Acquisition MicroApp version 2.0 supports PeakView software versions 2.1 and later.

Customizable information fields (or entries) linked to each database level may be replicated and summarized to upstream and downstream levels.

Proteomic data analysis using the TPP

GPS Explorer Software For Protein Identification Using the Applied Biosystems 4700 Proteomics Analyzer

PRM Method Development and Data Analysis with Skyline. With

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

R-software multims-toolbox. (User Guide)

ROTS: Reproducibility Optimized Test Statistic

Agilent Triple Quadrupole LC/MS Peptide Quantitation with Skyline

Computational Genomics and Molecular Biology, Fall

We are painfully aware that we don't have a good, introductory tutorial for Mascot on our web site. Its something that has come up in discussions

Blast2GO Teaching Exercises SOLUTIONS

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Sequence alignment theory and applications Session 3: BLAST algorithm

B MS1, MS2, and SQT Three Unified, Compact, and Easily Parsed File Formats for the Storage of Shotgun Proteomic Spectra and Identifications

Studying the effect of parallelization on the performance of Andromeda Search Engine: A search engine for peptides

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Import and preprocessing of raw spectrum data

Corra v2.0 User s Guide

Computational Molecular Biology

An improved peptide-spectral matching algorithm through distributed search over multiple cores and multiple CPUs

Analyzing ICAT Data. Analyzing ICAT Data

Data mining with Support Vector Machine

ZoomQuant Tutorial. Overview. RawBitZ

OCAP: An R package for analysing itraq data.

Protein Deconvolution Quick Start Guide

INTRODUCTION TO BIOINFORMATICS

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

Special course in Computer Science: Advanced Text Algorithms

Package rtandem. July 18, 2013

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Data processing. Filters and normalisation. Mélanie Pétéra W4M Core Team 31/05/2017 v 1.0.0

To 3D or not to 3D? Why GPUs Are Critical for 3D Mass Spectrometry Imaging Eri Rubin SagivTech Ltd.

An I/O device driver for bioinformatics tools: the case for BLAST

Skyline irt Retention Time Prediction

Introduction to Computational Molecular Biology

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0

Approaches to Efficient Multiple Sequence Alignment and Protein Search

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

PROTEOMIC COMMAND LINE SOLUTION. Linux User Guide December, B i. Bioinformatics Solutions Inc.

Comparison of Discrimination Methods for Peptide Classification in Tandem Mass Spectrometry

MASPECTRAS Users Guide

Package proteoqc. June 14, 2018

Transcription:

Project Report on De novo Peptide Sequencing Course: Math 574 Gaurav Kulkarni Washington State University

Introduction Protein is the fundamental building block of one s body. Many biological processes involve protein and many functionalities of the body are directly related to protein. The behavior of a protein is related to its structure and constituent nucleotides. Hence, understanding the structure of a protein is an important step in the study of molecular interactions. Protein can be represented as a sequence of characters, each character corresponding to an amino acid. Hence, problem of protein identification is, given an unknown sequence, finding its constituent amino acid sequence. [3] Tandem Mass Spectrometry Usually every approach used to identify a protein sequence makes use of Tandem Mass Spectrometry. This process, takes unknown peptide as an input, makes multiple copies of the peptide and takes out the fragments out of them. It then weighs them, finds abundance of each fragment and plots a graph of abundance versus mass to charge ratio. It outputs total mass of the parent peptide as well. The key idea behind Tandem Mass Spectrometry is to produce all possible prefixes and suffixes of the given peptide. A typical graph produced by the spectrometer is as shown in the figure 1. [1] Figure 1 Related Work There are two main approaches followed in the identification of a protein sequence: 1. Database Search Approach 2. De novo Sequencing Both of these approaches make use of Tandem Mass Spectrometry. The Database search approach makes use of database of known protein sequence. In this approach, a model spectrum is generated for each of the candidate and is matched against the spectrum for the experimental unknown peptide. This match is then scored

using a scoring function and the candidate peptide with maximum score is selected. [2] Example of the database search approach could be BLAST, SEQUEST. These techniques are highly dependant on the database and fail to identify novel protein, as the current databases that are available are not comprehensive. In the de novo sequencing approach, the experimental spectrum is converted into a spectrum graph. Each peak in the in the spectrum is represented as one or more nodes and an edge is drawn between two nodes, if their mass difference is equal to the mass of one or several amino acids. A path is then found out from this graph, for which various algorithms have been proposed. In the seqms algorithm, for each peak, set of possible ions is found out. Then each of these ions is then represented as a node in graph and edges are drawn if mass of two nodes differ by the mass of one or several amino acids. Algorithms like Dijkstra s algorithm can be used to find a complete path from N terminal to C terminal. [4] The Sherenga algorithm makes use of ion types learnt from a training set and offset δ i for the corresponding ion type i. For each peak in the experimental spectrum, k vertices are drawn at an offset of δ 1, δ 2 δ n, representing k ion types. Two vertices are then connected if their mass difference is equal to the mass of an amino acid. The peptide identification problem is then reduced to longest path problem. [4] The solution obtained using both these algorithm may contain two or more nodes corresponding to same peak. [4] Dynamic Algorithm for ideal De novo Sequencing This algorithm uses a dynamic programming strategy for De novo sequencing. Ideal De novo sequencing assumes that input does not contain any noise. The algorithm first converts experimental spectra into NC-spectrum graph, such that each peak corresponds to two nodes in the spectrum graph, each node representing a possibility of being a suffix or a prefix. The first half of the graph is then renamed as x 0, x 1... x k and the second half is renamed as y k, y k-1... y 0. Then the problem is reformulated as the problem of finding a feasible path in the graph, where a feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either Nj or Cj) [5]. The feasible path is found out by using a matrix M (i, j), where M (i, j) = 1 if and only if in the graph, there is a path L from x 0 to x i and a path R from y j to y 0, such that L U R contains exactly one of x p and y p for every p є [1, i] U [1, j], otherwise 0. [1] To construct the edges, the algorithm makes use of preprocessed mass array A, which takes input as mass and outputs whether that mass is equal to the mass of one or more amino acid. The algorithm assumes that mass of a fragment falls in a specific range. The algorithm for finding the matrix M is: 1. Initialize M (0, 0) = 1 and M (i, j) = 0 for all i 0 or j 0; 2. Compute M (1, 0) and M (0, 1); 3. For j = 2 to k 4. For i = 0 to j - 2 (a) if M (i, j - 1) = 1 and E (x i, x j ) = 1, then M (j, j - 1) = 1; (b) if M (i, j - 1) = 1 and E (y j, y j-1 ) = 1, then M (i, j) = 1; (c) if M (j 1, i) = 1 and E (x j-1, xj) = 1, then M (j, i) = 1; (d) if M (j 1, i) = 1 and E (y j, y i ) = 1, then M (j 1, j) = 1. [1]

To make sure that the feasible path contains only one node corresponding to each peak, feasible solution is built using matrix M. It is assumed that the feasible path contains x k. Hence the last column of row of the matrix is searched for the non-zero entries that satisfy both M (k, j) = 1 and E (x k, y j ) = 1. If j = k - 1, the search starts from i = k - 2 to 0 until both E (x i, x k ) = 1 and M(i, j) = 1 are satisfied; otherwise if j < k 1, then E (x k-1, x k ) = 1 and M (k 1, j) = 1. This process is then repeated to find every edge in the feasible solution [1]. Similar process holds for the solution containing node y k. [1] The algorithm takes O ( V 2 ) time to construct matrix M and O ( V ) time to find a feasible solution. Improved Algorithm for ideal De novo Sequencing The algorithm makes use of two arrays, called lce (.) array and dia (.) array. The array lce (i) stores the length of the longest consecutive edge starting from the node i. The array dia (z) is defined as: dia (x j ) = M (j, j - 1) for 0 < j k; dia (y j ) = M (j 1, j) for 0 < j k; dia (x 0 ) = dia (y 0 ) = 1. [1] Without loss of generality, one can assume i < j. If i = j -1, M (i, j) = dia (y j ). If i < j - 1 then M (i, j) = 1, if and only if M (i, i + 1) = 1 and E (y j, y j-1 ) =. = E (y i+2, y i+1 ) = 1, which is equivalent to dia (y i+1 ) = 1 and lce (y j ) j-i-1. Thus both cases can be solved in O (1) time [1]. The time required to construct the matrix M is O ( V ). Algorithm for real world peptide sequencing The algorithm can be extended to a case of experimental spectrum containing noise. In this case, the edges are scored by using some scoring function and then feasible path with maximum score is chosen as a solution. The scoring function can take into account various possibilities such as deviation of the mass difference from the mass of some amino acid, abundance of the peak, to name a few. Algorithm for one-amino acid modification In most of the cases a protein peptide is digested into multiple peptides and most of the peptides go through at the most one amino acid modification. The algorithm proposed can be used for the identification of this amino acid modification as well. The one-amino acid modification problem is equivalent to the problem which, given G = (V, E), asks for two nodes v i and v j, such that E (v i, v j ) = 0 but adding the edge (v i, v j ) to G creates a feasible solution that contains this edge. [1] Strengths of the Algorithm The algorithm builds a sequence out of the experimental spectrum. As the algorithm is not dependant on a particular database one can use this algorithm to find unknown or novel sequences.

The algorithm guarantees that only one node will be selected for each of the peak. Hence, it solves the problems faced by the previous algorithms like Sherenga or seqms. The algorithm can be used as a validation tool along with the database search approach. The algorithm can handle post translational modifications as well. For this, one can modify the scoring function to account for the post translational modifications. Drawbacks of the algorithm The algorithm is sensitive to noise and requires accurate input data. If the input misses any fragment then it may result into a wrong output The algorithm is highly spectrometer specific. Each available spectrometer has a different level of accuracy and produces different types of fragments as well. The De novo approach fails to take into account these diversities. The paper does not elaborate on the scoring function used. The algorithm s output depends a lot on the scoring function, as the real world spectrum usually contains noise. The algorithm, while building the NC-spectrum graph, draws an edge between two nodes, if their mass difference is equal to the mass of one or several amino acids. The algorithm assumes that the mass difference lies between a fixed range and hence is able to draw an edge in O (1) time, using a preprocessed mass array. In reality, the mass difference can go outside the range, in which case it is possible to draw an edge in polynomial time. Future work Efforts can be concentrated on the scoring function used to score an edge. An efficient scoring function should take care of the post translational modifications along with the input with noise. The scoring function can be extended further, to make the algorithm generic, by using different scoring function for different spectrometers. Conclusion There are two main approaches followed in identifying an unknown peptide. The database search approach is more accurate than the De novo sequencing approach, but, is not efficient in handling the post translational modifications. The approach even fails to identify novel sequences. On the other hand, De novo sequencing can handle post translational modifications and can identify novel proteins as well, but requires high quality input data. This approach is dependant on the spectrometer used as well. Hence, this approach is useful in the cases where input data is highly accurate.

Reference 1. T. Chen et al., A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry, Journal of Computational Biology 2001; 8:325 37. 2. B. Webb-Robertson and W. Cannon, Current trends in computational inference from mass spectrometry-based proteomics, bioinformatics, June 2007. 3. C. Oehmen, ScalaBLAST: A Scalable Implementation of BLAST for Highperformance Data-Intensive Bioinformatics Analysis, IEEE Transactions on Parallel and Distributed Systems, Vol. 17, No. 8, August 2006. 4. B. Lu and T. Chen, Algorithms for de novo peptide sequencing using tandem mass spectrometry, BIOSILICO Vol. 2, No. 2 March 2004 5. K. Chao, Slides on dynamic programming approach for De novo sequencing, (http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3a%2f%2fwww.csie.ntu.ed u.tw%2f~kmchao%2fseq04spr%2fde%2520novo%2520peptide%2520sequencing_v3. ppt&ei=2b0aspifgoespwtxsfn6ca&usg=afqjcnfuhxetu4c4cmlailqarnteh5kb Gg&sig2=xMA6oMsJObq5dgfoymjCkw)