Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Project Report on De novo Peptide Sequencing Course: Math 574 Gaurav Kulkarni Washington State University

Introduction Protein is the fundamental building block of one s body. Many biological processes involve protein and many functionalities of the body are directly related to protein. The behavior of a protein is related to its structure and constituent nucleotides. Hence, understanding the structure of a protein is an important step in the study of molecular interactions. Protein can be represented as a sequence of characters, each character corresponding to an amino acid. Hence, problem of protein identification is, given an unknown sequence, finding its constituent amino acid sequence. [3] Tandem Mass Spectrometry Usually every approach used to identify a protein sequence makes use of Tandem Mass Spectrometry. This process, takes unknown peptide as an input, makes multiple copies of the peptide and takes out the fragments out of them. It then weighs them, finds abundance of each fragment and plots a graph of abundance versus mass to charge ratio. It outputs total mass of the parent peptide as well. The key idea behind Tandem Mass Spectrometry is to produce all possible prefixes and suffixes of the given peptide. A typical graph produced by the spectrometer is as shown in the figure 1. [1] Figure 1 Related Work There are two main approaches followed in the identification of a protein sequence: 1. Database Search Approach 2. De novo Sequencing Both of these approaches make use of Tandem Mass Spectrometry. The Database search approach makes use of database of known protein sequence. In this approach, a model spectrum is generated for each of the candidate and is matched against the spectrum for the experimental unknown peptide. This match is then scored

using a scoring function and the candidate peptide with maximum score is selected. [2] Example of the database search approach could be BLAST, SEQUEST. These techniques are highly dependant on the database and fail to identify novel protein, as the current databases that are available are not comprehensive. In the de novo sequencing approach, the experimental spectrum is converted into a spectrum graph. Each peak in the in the spectrum is represented as one or more nodes and an edge is drawn between two nodes, if their mass difference is equal to the mass of one or several amino acids. A path is then found out from this graph, for which various algorithms have been proposed. In the seqms algorithm, for each peak, set of possible ions is found out. Then each of these ions is then represented as a node in graph and edges are drawn if mass of two nodes differ by the mass of one or several amino acids. Algorithms like Dijkstra s algorithm can be used to find a complete path from N terminal to C terminal. [4] The Sherenga algorithm makes use of ion types learnt from a training set and offset δ i for the corresponding ion type i. For each peak in the experimental spectrum, k vertices are drawn at an offset of δ 1, δ 2 δ n, representing k ion types. Two vertices are then connected if their mass difference is equal to the mass of an amino acid. The peptide identification problem is then reduced to longest path problem. [4] The solution obtained using both these algorithm may contain two or more nodes corresponding to same peak. [4] Dynamic Algorithm for ideal De novo Sequencing This algorithm uses a dynamic programming strategy for De novo sequencing. Ideal De novo sequencing assumes that input does not contain any noise. The algorithm first converts experimental spectra into NC-spectrum graph, such that each peak corresponds to two nodes in the spectrum graph, each node representing a possibility of being a suffix or a prefix. The first half of the graph is then renamed as x 0, x 1... x k and the second half is renamed as y k, y k-1... y 0. Then the problem is reformulated as the problem of finding a feasible path in the graph, where a feasible path is a path from N 0 to C 0 that goes through exactly one node for each pair (either Nj or Cj) [5]. The feasible path is found out by using a matrix M (i, j), where M (i, j) = 1 if and only if in the graph, there is a path L from x 0 to x i and a path R from y j to y 0, such that L U R contains exactly one of x p and y p for every p є [1, i] U [1, j], otherwise 0. [1] To construct the edges, the algorithm makes use of preprocessed mass array A, which takes input as mass and outputs whether that mass is equal to the mass of one or more amino acid. The algorithm assumes that mass of a fragment falls in a specific range. The algorithm for finding the matrix M is: 1. Initialize M (0, 0) = 1 and M (i, j) = 0 for all i 0 or j 0; 2. Compute M (1, 0) and M (0, 1); 3. For j = 2 to k 4. For i = 0 to j - 2 (a) if M (i, j - 1) = 1 and E (x i, x j ) = 1, then M (j, j - 1) = 1; (b) if M (i, j - 1) = 1 and E (y j, y j-1 ) = 1, then M (i, j) = 1; (c) if M (j 1, i) = 1 and E (x j-1, xj) = 1, then M (j, i) = 1; (d) if M (j 1, i) = 1 and E (y j, y i ) = 1, then M (j 1, j) = 1. [1]

To make sure that the feasible path contains only one node corresponding to each peak, feasible solution is built using matrix M. It is assumed that the feasible path contains x k. Hence the last column of row of the matrix is searched for the non-zero entries that satisfy both M (k, j) = 1 and E (x k, y j ) = 1. If j = k - 1, the search starts from i = k - 2 to 0 until both E (x i, x k ) = 1 and M(i, j) = 1 are satisfied; otherwise if j < k 1, then E (x k-1, x k ) = 1 and M (k 1, j) = 1. This process is then repeated to find every edge in the feasible solution [1]. Similar process holds for the solution containing node y k. [1] The algorithm takes O ( V 2 ) time to construct matrix M and O ( V ) time to find a feasible solution. Improved Algorithm for ideal De novo Sequencing The algorithm makes use of two arrays, called lce (.) array and dia (.) array. The array lce (i) stores the length of the longest consecutive edge starting from the node i. The array dia (z) is defined as: dia (x j ) = M (j, j - 1) for 0 < j k; dia (y j ) = M (j 1, j) for 0 < j k; dia (x 0 ) = dia (y 0 ) = 1. [1] Without loss of generality, one can assume i < j. If i = j -1, M (i, j) = dia (y j ). If i < j - 1 then M (i, j) = 1, if and only if M (i, i + 1) = 1 and E (y j, y j-1 ) =. = E (y i+2, y i+1 ) = 1, which is equivalent to dia (y i+1 ) = 1 and lce (y j ) j-i-1. Thus both cases can be solved in O (1) time [1]. The time required to construct the matrix M is O ( V ). Algorithm for real world peptide sequencing The algorithm can be extended to a case of experimental spectrum containing noise. In this case, the edges are scored by using some scoring function and then feasible path with maximum score is chosen as a solution. The scoring function can take into account various possibilities such as deviation of the mass difference from the mass of some amino acid, abundance of the peak, to name a few. Algorithm for one-amino acid modification In most of the cases a protein peptide is digested into multiple peptides and most of the peptides go through at the most one amino acid modification. The algorithm proposed can be used for the identification of this amino acid modification as well. The one-amino acid modification problem is equivalent to the problem which, given G = (V, E), asks for two nodes v i and v j, such that E (v i, v j ) = 0 but adding the edge (v i, v j ) to G creates a feasible solution that contains this edge. [1] Strengths of the Algorithm The algorithm builds a sequence out of the experimental spectrum. As the algorithm is not dependant on a particular database one can use this algorithm to find unknown or novel sequences.

The algorithm guarantees that only one node will be selected for each of the peak. Hence, it solves the problems faced by the previous algorithms like Sherenga or seqms. The algorithm can be used as a validation tool along with the database search approach. The algorithm can handle post translational modifications as well. For this, one can modify the scoring function to account for the post translational modifications. Drawbacks of the algorithm The algorithm is sensitive to noise and requires accurate input data. If the input misses any fragment then it may result into a wrong output The algorithm is highly spectrometer specific. Each available spectrometer has a different level of accuracy and produces different types of fragments as well. The De novo approach fails to take into account these diversities. The paper does not elaborate on the scoring function used. The algorithm s output depends a lot on the scoring function, as the real world spectrum usually contains noise. The algorithm, while building the NC-spectrum graph, draws an edge between two nodes, if their mass difference is equal to the mass of one or several amino acids. The algorithm assumes that the mass difference lies between a fixed range and hence is able to draw an edge in O (1) time, using a preprocessed mass array. In reality, the mass difference can go outside the range, in which case it is possible to draw an edge in polynomial time. Future work Efforts can be concentrated on the scoring function used to score an edge. An efficient scoring function should take care of the post translational modifications along with the input with noise. The scoring function can be extended further, to make the algorithm generic, by using different scoring function for different spectrometers. Conclusion There are two main approaches followed in identifying an unknown peptide. The database search approach is more accurate than the De novo sequencing approach, but, is not efficient in handling the post translational modifications. The approach even fails to identify novel sequences. On the other hand, De novo sequencing can handle post translational modifications and can identify novel proteins as well, but requires high quality input data. This approach is dependant on the spectrometer used as well. Hence, this approach is useful in the cases where input data is highly accurate.

Reference 1. T. Chen et al., A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry, Journal of Computational Biology 2001; 8:325 37. 2. B. Webb-Robertson and W. Cannon, Current trends in computational inference from mass spectrometry-based proteomics, bioinformatics, June 2007. 3. C. Oehmen, ScalaBLAST: A Scalable Implementation of BLAST for Highperformance Data-Intensive Bioinformatics Analysis, IEEE Transactions on Parallel and Distributed Systems, Vol. 17, No. 8, August 2006. 4. B. Lu and T. Chen, Algorithms for de novo peptide sequencing using tandem mass spectrometry, BIOSILICO Vol. 2, No. 2 March 2004 5. K. Chao, Slides on dynamic programming approach for De novo sequencing, (http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3a%2f%2fwww.csie.ntu.ed u.tw%2f~kmchao%2fseq04spr%2fde%2520novo%2520peptide%2520sequencing_v3. ppt&ei=2b0aspifgoespwtxsfn6ca&usg=afqjcnfuhxetu4c4cmlailqarnteh5kb Gg&sig2=xMA6oMsJObq5dgfoymjCkw)