Correlogram-based method for comparing biological sequences

Size: px

Start display at page:

Download "Correlogram-based method for comparing biological sequences"

Sabina Bruce
5 years ago
Views:

1 Correlogram-based method for comparing biological sequences Debasis Mitra*, Gandhali Samant* and Kuntal Sengupta + Department of Computer Sciences Florida Institute of Technology Melbourne, Florida, USA {dmitra, gsamant}@fit.edu + Authentec Corporation Melbourne, Florida, USA Abstract. In this article we have proposed an abstract representation for a sequence using a constant sized 3D matrix. Subsequently the representation may be utilized for many analytical purposes. We have attempted to use it for comparing sequences, and analyzed the method s asymptotic complexity. Providing a metric for sequence comparison is an underlying operation to many bioinformatics applications. In order to show the effectiveness of the proposed sequence comparison technique we have generated some phylogeny over two sets of bio-sequences and compared them with the ones available in literature. The results prove that our technique is comparable to the standard ones. The technique, called the correlogram-based method, is borrowed from the image analysis area. We have also done some experiments with synthetically generated sequences in order to compare correlogram-based method with the well-known dynamic programming method. Finally, we have discussed some other possibilities on how our method can be used or extended. 1. Introduction Sequence comparison constitutes one of the most fundamental operations in many problems in bio-informatics. For this reason many sequence comparison techniques have been developed in the literature sometimes targeting specific problems. In this article we have proposed a novel comparison method and have shown a few of its usages. Possibly the most accurate sequence comparison technique is the dynamic programming algorithm of Smith and Waterman [1981]. The primary objective of this algorithm is to align two sequences optimally. When the objective is to come up with a distance or a similarity value between two sequences the global alignment provides a mechanism to achieve that. The value of the optimizing function in that case is typically utilized as a similarity parameter. Another popular and efficient sequence comparison method is the BLAST [Altschul, 1990] algorithm. However, its primary purpose is to find homologous longest common sub-sequence between two bio-

2 sequences. BLAST is a problem specific algorithm and is not a competitor to our proposed method. Our proposed technique is based on a similar method introduced for comparing two images [Huang et al, 1999]. While an image is a two-dimensional organization of the pixels a bio-sequence is a one dimensional organization of characters from a finite set of alphabets (typically nucleic acids or amino acids). We create a mathematical representation of a sequence and subsequently use that representation for the comparison purpose. In the following sections we describe the method (section 2) and some experiments toward using it for sequence comparison (section 3). We conclude with a discussion on a few other possibilities with using the correlogram representations of sequences (section 4). The concept of correlogram has been used in the field of bioinformatics before. Macchiato et al. [1995] used correlograms to analyze autocorrelation characteristics of active polypeptides. Further, correlograms have been used for analyzing spatial patterns in various experiments, e.g., Bertorelle et al [1995] used correlograms to study DNA diversity. Rosenberg et al [1995] used correlograms in their studies regarding patterns of transitional mutation biases within and among mammalian genomes. However, the representation has not been used for the sequence comparison purposes before. 2. Correlogram and its usage in sequence comparison 2.1. Correlogram of a sequence Let a sequence be indicated by, s = a 1 a 2... a n, where s =n, and for all i, a i, is the finite set of alphabets over nucleic acids ({A, T, G, C} for a DNA or {A, U, G, C} for a RNA) or over twenty amino acids for a protein. Let, = m (m is 4 or 20). Definition 1: A correlogram for s is a 3-dimensional matrix of size (m x m x d), where 0<d<n is a predefined integer (typically between 4 to 7). Both the first two dimensions of the matrix represents the alphabets in, and the third dimension is over the integer index i, 0 i d. For x, y, and 0 i d, let Freq s (x, y, i) be the frequency of occurrence of pairs (x, y) at a distance i, on the sequence s. Each entry of the Correlogram matrix for the sequence s, Corr s (x, y, i) is the normalized frequency, Freq s (x, y, i)/ N, where N = (n i) is the total number of pairs in the sequence at a distance i. The normalization is needed to compensate for the sequence length, so that the sequences of different lengths can be subsequently compared. The sequences with greater lengths will have tendency to have higher frequencies of pairs of (x, y) s.

3 i C G T A A T G C Figure 1: A shell of a Correlogram over {A,T,G,C} and d=3 Example 1: With = {A, C, G, T}, a string S = AGCTTAGTCT. The Freq s (x, y, 1) for the plane with i =1 is the following matrix in Figure 2. Corr s (x, y, 1) plane will have each of these elements divided by 9. A T G C A 2/9 T 1/9 1/9 1/9 G 1/9 1/9 C 2/9 Figure2: A layer for a sample correlogram for S Note that the list of the distances corresponding to the planes of correlograms need not be all the integers between 0 and d (0 i d). Rather they could be a predetermined finite set of integers each less than n, the length of the sequence. For example, i could be {0, 3, 5, 7, 11}. The particular application determines this list.

4 The correlogram-plane for the distance i=0 is nothing but a normalized histogram representing the normalized frequencies of the occurrences of the characters in the sequence. The corresponding plane (for i=0) is a 2D diagonal matrix Computing Correlogram The following algorithm computes a correlogram given an input sequence s and a d value. Algorithm ComputeCorrelogram (string S, integer d) // Let, S = a 1 a 2... a n, where each a j (1) for each x, y do (2) for i = 1 through d do (3) Corr s (x, y, i) = 0; (4) for i = 0 through d do (5) for integer j = 1 through (n - i) do (6) Corr s (a j, a j+i, i) = Corr s (a j, a j+i, i) + (1/(n-i)); (7) return Corr s ; End Algorithm. The complexity of initialization from lines 1 through 3 is O(m 2 d), where = m. The complexity of the main computing loops in lines 4 through 6 is O(nd). So, the total complexity is O(max{m 2 d, nd}). For a large sequence n >> m, and hence, the complexity is O(nd). Also, since m is a constant (m= 20 or 4), and so is d, ComputeCorrelogram is a linear algorithm with respect to the sequence length Using Correlograms to Compare Sequences Once a sequence is transformed into a correlogram it is possible to measure a distance between two correlograms corresponding to two sequences. Definition 2: The distance between two sequences S and T is l st = L(Corr S, Corr T ), where the function L is one of the standard L-norms of distance metrics. For L 0 -norm: l st = x,y,, 0 i d Corr S (x, y, i) - Corr T (x, y, i) / ( S + T +1) For L 1 -norm: l st = ( x,y,, 0 i d [Corr S (x, y, i) - Corr T (x, y, i)] 2 )/ ( S + T +1) Higher order L-norm distance metrics may be defined accordingly. Since both the correlogram-matrices are of the same dimension the computation of any of these L-norm distances is of the order of O(m 2 d). We used L 1 - norm for our experiments. The distance measure is apparently a metric as evidenced in some of our preliminary experiments (not presented here).

5 3. Experiments We have done three sets of experiments in order to study the effectiveness of the correlogram method in the sequence comparison. They are described below Experiments with synthetic data In this set of experiments we compared our proposed technique with Smith- Waterman s [1981] Dynamic Programming (DP) method over some synthetically generated sequences. For some experiments, we start with a target sequence S and deform it systematically to S and measure the distance (or similarity with DP) between the two sequences (l SS ) using the two methods (correlogram and DP). In other experiments, we start with two arbitrary sequences S1 and S2 and deform one of them (S2) systematically to S2 and measure how the distance between them (l S1S2 ) changes with the deformation. Such experiments with synthetic sequences have never been done before, to the best of our knowledge. For the lack of space we will provide some sample results of our experiments from this set [for detail see the Tech Report, Samant et al, 2005]. The same conclusion holds over all such experiments that the Correlogram method is more sensitive to the deformation of a sequence than the DP method. Correlogram Score DP Score Scores Iterations Figure 3: Correlogram vs. DP scores against character deletion positions (iterations) Figure 3 shows the result (l S1S2 ) obtained by deleting a character from the second sequence S2. The position j of the deleted character is systematically varied between 1 j n, for S2 =n. As expected, DP is not very sensitive to the position of

6 the character being deleted, whereas the correlogram method shows some fluctuation as the position progresses over the string S2. A cautionary note here is that the absolute values of the two methods should not be compared as the two methods measure different aspects distances and similarities. Rather their relative change with respect to the control parameter (position of deletion in this experiment) should be compared. Our conclusion is that the correlogram method, overall, is more suitable for comparing sequences when character deletion takes place. This is expected, as we create a richer abstract representation (correlograms) for each of the sequences before we compare the two sequences, vis a vis the DP method. Figure 4 shows the results from a similar experiment where a sequence is wrapped around systematically (first with one character, then with 2 characters, and so on iterations on X-axis in the figure indicates this number), and the distance is measured between the original and the deformed sequence. For example, a wrapping around of string AGCTTAGTCT for i=2 is, CTAGCTTAGT. The higher sensitivity of the correlogram-based method is evidenced in Fig 4 as well. Other experiments, done with systematic character addition, and by reversal of a sequence, also provided the same conclusion as drawn from the character deletion-experiment. Correlogram Score DP Score Score Iterations Figure 4: Correlogram vs. DP scores against systematic circular permutation of the source sequence 3.2. Experiment with Equine influenza virus In this experiment we used a set of protein sequences that is important for immunity of the horse influenza virus. The protein sequences are products of the hemagglutinin (HA) gene that has gone through multiple mutations over ten years (1990 through 2000) as the infected horse moved through different parts of the USA. We constructed the phylogeny tree from the distance values generated by the correlogram method and

7 compared the tree with a standard work on the same data available in the literature [Lai et al, 2004]. The Figure 5a and 5b shows the two trees over the set of viruses (SA90/AF197243, SU90/X68437, LM92/X85087, HK92/L27597, KY91/L39918, KY92/L39917, KY94/L39914, KY95/AF197247, KY96/AF197248, KY97/AF197249, KY98/AF197241, FL93/L39916, FL94/AF197242, AR93/L39913, AR94/AF197245, AR95/AF197244, AR96/AF197246, NY99/AY273167, OK00/AY273168). The first part of each of the elements in this set is the common identifier of the corresponding influenza-a virus, whereas the second part is the respective accession number to the database (EMBL-EBI, European Bioinformatics Institute, The strings are a few hundred characters long. Lai et al [2004] used GeneTool version 1.1 ( to generate the distance matrix and then ran the Phylip software (Neigbor-join method) from the University of Washington to draw the tree (Fig 4a). We used correlogram method to generate the distance matrix and used the same program from Phylip package ( evolution.genetics.washington.edu/phylip.html) for the phylogeny construction. The two trees are small enough to be compared manually. The similarity between the two trees justifies the usability of correlogram method in drawing phylogeny. The minor differences between the two trees necessitate further investigation for their biological significances Experiment with Parvo-virus Parvo-virus family resides in the intestines of higher organisms. They are known to cause illness/death of children and are a focus of medical research. The RNA sequences of the viruses of this family from different organisms have been used and we have measured the distances between the sequences using our proposed method. The sequences are from the set: (B19 virus, Bovine parvovirus, Canine parvovirus strain B, Feline panleukopenia virus (strain 193), Murine minute virus (strain MVMI), Porcine parvovirus (strain NADL-2), Raccoon parvovirus, Adeno-associated virus 2, Galleria mellonella densovirus) studied in the literature [Chapman et al, 1993]. The strings are around 5000 characters long. We have drawn phylogeny tree from the generated distance matrix over the family. Again the striking similarity (Figures 6a and 6b) with Chapman et al s tree proves the strength of the correlogram-based method. The two viruses AAV2 and B19 have low sequence similarity and high structural similarity compared to other viruses. Their coming closer to each other in the phylogeny in correlogram-based tree suggests that our proposed method has a stronger capability to classify structures. However, this needs further experimentation to verify.

8 Figure 5: Phylogeny trees of the Horse Influenza HA1, a. Lai et al (2004), b. Correlogram-based Figure 6: Phylogeny trees of the Parvovirus RNA sequences, a. Chapman et al (1993), b. Correlogram-based

9 4. Conclusion and pointers In this research we have proposed a new approach toward sequence representation and investigated a few of its capabilities. In order to fully understand the significance of the representation many new questions and challenges need to be addressed. Some of them are posed below Information content of a correlogram Is correlogram reversible? In other words, can one reconstruct a string given a correlogram? Obviously, when the distance range of the correlogram matrix is maximal or d= s -1 for a string s, no information is lost and the corresponding correlogram should be reversible. A relevant question is - does there exist a smaller list of integers i < s such that a correlogram over this list of i s (refer to the discussion in section 2.1) will create a loss-less representation? In that case, the correlogram representation may be used for the string encryption purposes Using correlgram for finding patterns Any sequence comparison method can be utilized for finding a given pattern P over a longer target string T ( T > P ). However, the cost of the method could be prohibitive for such pattern finding purpose. For example, the DP method has quadratic timecomplexity for each comparison (P with each subsequence of T of size P ) over scanning the target sequence. We have implemented a modified version of the correlogram method that scans and searches for a pattern over a target sequence in linear time [Samant et al, 2005]. As our experiments indicate that the correlogram method may have a potential for finding structurally similar patterns, it may have significant impact in bioinformatics. For example, protein docking may be expedited by such candidate pattern-matching pre-scan Extending correlogram with gap handling capability When a pair of characters, say, AG, appears at a distance i=7 on a sequence (as A G), then it may appear at a distance 8 or 6 on a corresponding mutated sequence, where a new character is inserted or deleted from the original sequence. With an objective to use the correlogram-based representation for comparing such modified sequences we extended the basic correlogram technique. In this Gapped-correlogram representation of a sequence we make a weighted distribution of the frequency counts over the respective adjacent cells of the basic correlogram (over the distance i). Thus, e1*freq(x, y, i) is added to Freq(x, y, i+1) and to Freq(x, y, i-1), and, e2*freq(x, y, i) is added to Freq(x, y, i+2) and to Freq(x, y, i-2), where 0 < e1, e2 <1, are the weight factors. Example e1 and e2 may be 0.5 and 0.25 respectively. This type of weighing can be extended to arbitrary number of adjacent cells, not just to two cell-distances. The weight vector itself may be normalized as a probability density distribution that adds up to 1, e.g., e0+2*(e1+e2)=1 above, where e0 is the weight factor for Freq(x, y, i) itself that was e0=1 before.

10 Gapped-correlogram has obvious biological appeal in sequence comparison. However, our preliminary experiments with gapped-correlograms over the problems addressed here did not show any significant difference with the results obtained using the basic correlograms [Samant et al, 2005]. We suspect broader experiments may show some impact. Acknowledgement: Mavis McKenna provided some data and insight for this work. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, Journal of Molecular. Biology, 215: Bertorelle, G., and Barbujanit, G. (1995) Analysis of DNA diversity by spatial auto correlation, Genetics, Volume 140(2): Chapman, M. S., and Rossmann, M. G. (1993) Structure, Sequence and Function Correlations among Parvoviruses, Viriology, 194(2): Huang, J., Mitra, M., Zhu, W.J. and Zabih, R. (1999), Image Indexing using color correlograms, International Journal of Computer Vision, 35(3), pp Lai, A. C.K., Rogers, K. M., Glaser, A., Tudor, L., and Chambers, T. Alternate circulation of recent equine-2 influenza viruses (H3N8) from two distinct lineages in the United States, Virus Res. (2004) Mar, 15;100(2): Macchiato, M. F., Cuomo, V., and Tramontano, A. (1995) Determination of the autocorrelation orders of proteins, Genetics, 140: Rosenberg, M. S., Subramanian, S., and Kumar S. (2003) Patterns of Transitional Mutation Biases Within and Among Mammalian Genomes, Mol Biol Evol. (2003) Jun;20(6): Samant, G., and Mitra, D. (2005) Correlogram method for Comparing Bio- Sequences, Florida Institute of Technology Technical Report No. CS , Smith, T.F., and Waterman, M.S. (1981) Identification of common molecular sequences, Journal of Molecular Biology, 147:

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources