Metric Indexing of Protein Databases and Promising Approaches

Size: px

Start display at page:

Download "Metric Indexing of Protein Databases and Promising Approaches"

Janis Nash
5 years ago
Views:

1 WDS'07 Proceedings of Contributed Papers, Part I, 91 97, ISBN MATFYZPRESS Metric Indexing of Protein Databases and Promising Approaches D. Hoksza Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic. Abstract. Most widely used biological databases nowadays are nucleotide and protein ones. These databases are crucial for determination of biological functions of living organisms with respect to their DNA structure. The biological function of a protein can be derived from the similarity with another protein with known function which is stored in a database and therefore the chance of finding the biological function of given protein or DNA sequence grows with size of the database. Because of this fact, the growth is exponential which in turn calls for sublinear methods of searching these databases. Optimal solution is aligning the query sequence with all sequences in the queried database. Since aligning of two sequences is computationally expensive, fast heuristic methods (e.g. BLAST [Altschul et al., 1997]) are used although they can only approximate the optimal solution without restricting the resulting error. In this paper we try to use metric access methods (MAMs) for exact and approximate searching through protein databases. As experiments show, such a straightforward use of MAMs is not very suitable, therefore we also show possible further directions in the area of indexing protein sequences based on the so far learned facts. Introduction The reason for existence of databases of protein sequences is that similar proteins secure similar biological functions, hence it makes sense to store such sequences in a database. This similarity is shared even among different species and because of this fact the growth of these databases have been exponential in the last ten years. One way to handle exponentially growing data is to use indexing since then complexity of searching is sublinear. There have been attempts to incorporate indexing techniques in searching of protein databases but almost all of these techniques split sequences into q-grams and use distance functions which are easy indexable but suffer from neglecting biological meaning of protein similarity (see Current Indexing Approaches section). Because of that nowadays most widely used method is BLAST which is heuristic approach with linear time complexity. Protein Databases 1 DNA molecules consist of two strings of nucleotides (conventionally labeled A, C, G, T) which can be transcribed to RNA and later translated to proteins which are linear polymers of 20 types of amino acids. The prescript for translating every triplet of nucleotides (codon) into a amino acid is called genetic code. Combination of proteins secures a biological function which is determined by the three-dimensional structure of the proteins involved. Important fact is that proteins with similar sequences have similar three-dimensional structure ergo similar function 2. When the sequence of amino acids of a protein is determined we usually want to know it s function. The easiest way to find it out is to perform a search against database of protein sequence with known functions. This will give us possible purpose of the examined sequence and we can use it as a clue to it s function. To increase probability of finding similar protein, the database which is searched for similarity should be as large as possible. 1 We don t consider nucleotide databases in this paper which are handled similarly (but for different purposes). 2 Moreover more extensive sequence similarity can be viewed as evidence of common ancestry, and therefore as a basis for reconstructing phylogenetic history of organisms and their genes. 91

2 Similarity Search in Sequence Databases When comparing sequences we can tell about two sequences 3 that they are somehow similar. Intuitively it means that those two sequences have sufficient number of identical (similar) letters at identical (similar) positions (when considering inserting gaps into both sequences). Hence finding optimal alignment means inserting gaps into both sequences to make the score of the alignment as best as possible. Consequently, the score can be further used in retrieval of sequences similar to a query sequence. Common string distance measures. If we accept the distance as the number of identical letters on identical positions (thus not allowing gaps) we speak about Hamming distance (HD) which is defined on sequences of equal length. If we loosen this condition we get Levenshtein (or edit) distance given by the minimum number of operations needed to transform one sequence into the other, where the operation can be insertion, deletion, or replacement of a single letter. Finally we can reflect similarity of letters and equip each pair of letters with a weight and we get weighted edit distance (with scoring matrix). Values (weights) for each pair of letters are stored in a squared scoring matrix. There are many different scoring matrices used in bioinformatics for various purposes (e.g. PAM matrices [Dayhoff et al., 1978], BLOSUM matrices [Henikoff S. and Henikoff J.G., 1978], etc.). The score of an alignment is then given as total of prices of the gaps (penalization) and values of matches/mismatches at the right position in the scoring matrix (see Figure 1). Figure 1. Global and local alignment (and their scores) of protein sequences NPHGIIMGLAE and HGLGL according to BLOSUM62 scoring matrix Global and Local Alignment Measures. We further distinguish global and local alignment. Both are weighted edit distances but global alignment aligns whole sequences whilst local alignment globally aligns every two substrings of the two sequences and the best alignment of them is declared to be the value of the local alignment 4 (see Figure 1). Algorithms for both of these problems are of exponential complexity when taking all possible alignments into account. But for both of them there are dynamic programming solutions which compute the alignments in O(m n) time and space. The algorithm for global alignment was published in 1970 by Needleman and Wunsch [Needleman and Wunsch, 1970] and in 1981 there was proposed an extension to it by Smith and Waterman [Smith and Waterman, 1981] (SW) which computes the local alignment with the same time and space complexity as the global one. The core of the solution is n m matrix s where cell s i,j of the matrix stays for the optimal (e.g. maximal) score which belongs to prefixes of lengths i and j of the aligned sequences. In the initialization stage the cell s 0,0 is set to 0 and 0th row and column are filled with scores corresponding to 0th prefix of the first and ith prefix of the second string and vice versa. The recursive formula for filling matrix s cells when computing the global alignment is: s i 1,j + σ s i,j = max s i,j 1 + σ (1) s i,j + δ(a i, b j ) where a and b represent the sequences to be aligned, σ is a score for gaps and δ is the scoring matrix. Since s i,j contains score of the global alignment of the i-long prefix of a and j-long prefix of b, the cell s a, b contains the alignment score for the whole sequences. For computing local alignment, the only change to the recursive formula is adding 0 to values out of which the maximal value is computed. Hence we stop alignment done so far and start new local alignment at position [i,j] whenever it would improve the alignment. The optimal alignment is than the highest value in the matrix. Finally, there is usually another modification which enables differentiating between an opening and extending the gap. Extending a gap is considerable less penalized than opening a gap. 3 There can be also multiple sequences mutually aligned for some purposes, but this is out of scope of this paper. 4 In aligning protein sequences, local alignment is used. 92

3 Heuristic Approaches to Retrieval of Similar Sequences In the time when Smith-Waterman algorithm was invented it was the optimal way for searching through protein sequences for similarities. But, as mentioned earlier, since then sizes of databases have been exponentially growing and it is no more feasible to align each sequence with the query sequence in quadratic time. Therefore heuristic methods have been proposed to reduce resulting time. Examples of these methods are FASTA [Pearson and Lipman, 1988], BLAST (Basic Local Alignment Tool) [Altschul et al., 1997],... Since BLAST is nowadays most widely used method we will describe it here (other heuristic methods work in a similar way): 1) Remove low complexity regions from the query sequence (those with no meaningful alignment). 2) Generate all n-grams substrings of length n from query sequence. 3) Compute the similarity for every sequence of length n (on a given alphabet) and each n-gram from the previous step. 4) Filter out sequences with similarity lower than a cut-off score (called neighborhood word score threshold). 5) Remaining high-scoring sequences (organized in search tree) are then used to search all database sequences for exact match. 6) High-scoring sequences within a given distance (those on the same diagonal if we imagine sequences in a matrix) are then connected together with gapped alignment and these are being extended as long as the score grows 5. Such alignments are called high scoring pairs (HSP). 7) All HSPs with scores below a given threshold are excluded. 8) The scores of non-filtered sequences are refined by the classic Smith-Waterman algorithm. Statistical Relevance. To see how significant given alignment is, there is defined so called that determines whether an alignment happened just by chance or whether it is somehow important. Thus is expected number of sequences of length m and n with score at least S in database with N residues (amino acids) as 6 E = Km eff n eff e λs N/n (2) where K and λ are characteristics of the SW score distribution [Karlin and Altschul, 1990] and m eff and n eff are effective lengths of sequences that compensate the effect of higher probability of occurrence of an alignment in the middle of sequences. Current Indexing Approaches Although BLAST shows good speed and accuracy, there is still 64% slowdown every year (as stated in [Cameron et al., 2004]) because of exponential growth of the databases which BLAST searches for similarities. This is the reason why there have been efforts to find methods which would search for similarities in sublinear time, e.g. indexing. Next, we will show some of these indexing methods (both metric and non-metric). SSAHA. SSAHA [Ning et al., 2001] is a method that is primarily meant to be used for nucleotide sequences but the idea could be generalized for protein sequences too. It makes use of hashing all q-grams of stored sequences where for each q-gram there is one tuple consisting of index of the sequence and offset of the q-gram. When searching, the query sequences is splitted into q-grams and all hits of every such q-gram is converted to a triplet (index, offset t, offset) where t is position of the q-gram in the query sequence. List of these triplets is then sorted according to the first two attributes. Consecutive sections of the list of size n having identical first two attributes are hits of length q n. BLAT. Direct improvement of BLAST is BLAT (BLAST-like alignment tool) [Kent, 2002] that uses indexing for improving effectivity of BLAST. Probably the most important modification is that whereas BLAST uses search tree for the query sequence and scans the database letter by letter, BLAT does it contrariwise. It organises non-overlapping q-grams of the database sequences into index and this index is used for searching for similarity with every q-gram of the query sequence. Of course, because of splitting into q-grams, the method is still heuristic, much like BLAST. BLAT was principally intended to be used for nucleotide sequences. FT#N(s). FT#N(s) [Ozturk and Ferhatosmanoglu, 2003] is a method that uses transformation to turn sequences of strings into multidimensional vectors of numbers, which are further indexed with R-tree (any other indexing method can be involved, of course). Distance functions are then defined on these indexed data to secure similarity search among vectors (i.e. sequences). Frequencies of q-grams are used 5 This applies to BLAST2 - previous versions of BLAST did not connect sequences on diagonals (and therefore used higher value for cut-off score in step 4). 6 This is the way how BLAST compute it s (other heuristics can do it in a different way. 93

4 as vectors (i.e. FT2N(s) for 2-grams of sequence of s) and the distance is FD#N(s 1, s 2 ) which is defined as maximum of differences between each of the dimensions 7. This distance is proved to be the lower bound of the edit distance between original strings. With growing length of the q-grams, length of the transformed vector grows exponentially therefore [Hsieh et al., 2006] introduced a method which uses similar transformations and distance functions but grows linearly with number of features which it uses instead of q-grams. mpam. Authors of mpam method used observation made in [Sellers, 1974] that global alignment with metric scoring matrix is a metric function. Therefore they revisited mathematics of the PAM matrices which resulted in mpam (metric PAM) substitution matrix which has similar sensitivity to PAM matrix ([Xu and Miranker, 2004] ). This matrix is then used in global alignment to define measure for indexing q-grams with MVP-tree (multi vantage point tree). That is very similar to BLAT but unlike BLAT, mpam method employs global edit distance with substitution matrix while aligning q-grams (BLAT uses exact or near-exact matching or Hamming distance 8 ). Metric Sequence Indexing & Search One can see that almost all of the methods mentioned above use simpler distance functions than weighted edit local distance, hence they are not very well comparable with optimal alignment or with current daily used methods (e.g. BLAST) and definitely can t give biologically optimal (correct) solutions. Our idea was to preserve the commonly used distance and turn it into a distance metric δ which satisfies the metric properties (reflexivity, non-negativity, symmetry and triangle inequality). Such a metric could then be utilized by various metric access methods (MAMs) [Zezula et al., 2005]. MAM s use triangle inequality to organize database objects into metric regions and only those regions that have nonempty intersection with the query region need to be examined. Creating the Metric To be able to compare our method with practical solutions (e.g. BLAST) we need to use as distance measure. Since is not a metric we need to turn it into it. The original satisfies just the non-negativity property, but it can be easily modified to satisfy also reflexivity and symmetry. To enforce reflexivity, we make identical sequences to have zero (the probability that non-identical alignment would have lower is in practice really low). To satisfy the symmetry, we have to accomplish a more substantial change. A problem causes the query length in computing the but because in average length of the query is similar to average DB sequence length (query object uses to be a protein sequence), we replace n in the formula by max(m, n) TriGen Algorithm. Now, the hardest part has left and that is to fulfill the triangle inequality. We use the Trigen algorithm [Skopal, 2006] designed for turning semi-metrics into metrics (or approximations of metrics) by applying concave similarity preserving modifiers (functions). Such an modifier is applied to a training set of triplets and makes them to preserve the triangular inequality. Moreover, one can specify amount of triplets from the training set that may violate the triangle inequality (so-called T-error tolerance) and thus allowing faster but approximate searching. LAESA The LAESA method [Micó et al., 1994] is a pivot-based MAM, which uses m pivots (objects from database) for mapping each object into a m dimensional vector, hence the database is n m matrix. When querying, query is mapped into the pivot space and the matrix is sequentially scanned for (candidate) objects that overlap with the query. Candidates are then filtered in the original space. The LAESA method is very powerful in its pruning effectiveness, however, due to expensive selection of pivots and due to the sequential processing of distance matrix its usage in dynamic database environments is limited. & P A typical tree-based MAM designed for database environments is the [Ciaccia et al., 1997]. recursively bounds objects into balls specified by a center data object (one of the indexed ones). 7 Also transformation and distance based on wavelet functions is introduced in the paper 8 When aligning nucleotide sequences (for which BLAT is primarily intended), substitution matrices are not usually used, so it s all right not to use weighted edit distance 94

5 The inner nodes of an index contain routing entries, consisting of a region ball and a pointer to the subtree (all objects in a subtree must fall into the parent region ball). The leaf nodes contain ground entries the DB objects themselves. When querying just the nodes intersecting query ball (which is not constant in the case of knn queries) are further processed. P proposed in [Skopal, 2004; Skopal et al., 2005] is then combination of both methods mentioned above - LAESA and. The entries contain also a set of pivots which prune the region ball, thus, the total volume of a P s data region is always smaller than an equivalent region. The number of pivots in the routing and ground entries may differ. s and P s data regions can overlap as a result of a bad splitting. To fix this, Slim-down algorithm [Skopal et al., 2003] was proposed to optimize an already built index. Slimming down is very expensive but it can significantly speed up querying (up to 10x). Experimental Results As dataset we used random subset of the Swiss-Prot database ([Bairoch et al., 2004]) of size 3000 with total number of amino acids. Another random 100 hundred sequences have been chosen as query sequences.all of the sequences were of maximal length 1000 which doesn t cause any problem when we realize that there are only 9191 longer sequences out of in whole Swiss-prot which makes 3% (average sequence length is 365 in whole Swiss-prot and 335 in the reduced variant). These longer sequences could then be treated in special way since they are just small part of the whole. doesn t cause any problem when we realize that there are only 9191 longer sequences out of in whole Swiss-prot which makes 3% (average sequence length is 365 in whole Swiss-prot and 335 in the reduced variant). These longer sequences could then be treated in special way since they are just small part of the whole. We don t show time comparison in our tests because we do not have effective implementation of Smith-Waterman yet, being the crucial component of the running time. But our method could be easily compared to SSEARCH (part of FASTA package) when we realize that SSEARCH is equivalent to sequence scan. To be able to compare index based methods with BLAST we distinguish number of distance computations from computational costs. We defined computational costs here as number of comparing two letters. Therefore computational cost of tree based methods are averaged as number of distance computations multiplied by the average size of distance matrix for Smith-Waterman which is = (for the derivation of computational costs of BLAST see [Hoksza and Skopal, 2007]). To be able to compare index-based methods with sequential scan, we show number of distance computations too. Four indexing methods were tested -, P, slimmed P and LAESA, each of them using the same set of distance modificators 9 generated by the TriGen algorithm. As can be seen in Figure 2a, when the zero error tolerance is used, weight of the modificator causes that number of distance computations is almost equivalent to sequence scan. When the weight is too big, it makes triangle inequality hold but for the price of increasing intrinsic dimension. However performs slightly better then sequence scan which means that when searching in the tree, not all of the nodes have been inspected (because of inner nodes, number of objects in the tree exceeds 3000). On the other side, P and slimmed P show worse result since there are additional computations to pivots (mapping the query). The most similar to sequential scan is LAESA method, which uses constant number of distance computations independently on range of the query. The situation slightly changes when we allow small error (Figure 2b). This causes that number of distance computations decrease about six percent compared to zero error tolerance. Here, P and slimmed PM tree outperform and the difference is about 1%. On the other hand LAESA showed just slight improvement. But in both cases BLAST method is evidently more effective since effctivity of the index is almost sequence scan even if small error is allowed. Why P and slimmed P behave better when allowing some error? Answer to this question can be seen on Figure 2c which shows on range query of five the relation between declared TriGen error tolerance and real error experienced in test. Here we can see, that P real error growth more quickly than the error of and moreover, we can see that slope of those lines are almost inverse, which means that the distance computations gain of P is counterbalanced by the error. From these two graphs can also be seen that for real error 50%, there still have to be done approximately 1200 distance computations which is about computational operations. That 9 fractional power modificators were used 95

6 Computational costs (milions) P(32,16) SlimP(32,16) BLAST Error tolerance 0 Computational costs (millions) Error tolerance P(32,16) SlimP(32,16) BLAST Real error P(32,16) SlimP(32,16) 5 Distance c omputations Error tolerance 0 P(32,16) SlimP(32,16) Sequential Scan Distance Computations Error tolerance P(32,16) SlimP(32,16) Sequential Scan Distance computations TriGen error tolerance 5 P(32,16) SlimP(32,16) TriGen error tolerance (a) (b) (c) Figure 2. Relation between and number of computations - range query (a,b) and relation between real error and distance computations (c) means that if BLAST would be such a bad hauristic that it would have just 50% successfulness, it still would be noticeable more effective. Conclusions and new promising approaches In this paper, we have have tested suitability of metric access indexing methods for indexing protein sequences. It has been shown that these method are not applicable to sequence alignment problem without their modification. This is primarily because of quality of the data to be indexed and the distance function which is used to define similarity between them. This distance function is highly nonmetric which demands strong modifications to it, to make it metric. This modification distorts distances in a way that strongly increases intrinsic dimension of the data and therefore the efficiency of search is almost the same as efficiency of sequential scan. But against sequential scan it has that advantage that precision can be defined and thus traded off for efficiency. This learned facts can aim next research to several areas. To name a few: Examining TriGen modificators and finding such modificators, which would minimize real error while distributing objects (i.e. sequences) in the space in a way which will be appropriate for indexing methods (i.e. decreasing intrinsic dimension). Modifying the search structures. For example examining possibilities of cutting sequences to q- grams but being able to define arised error (caused by splitting and thus losing information included in the whole sequence) and (optimally) minimize it. Modifying computing of Smith-Waterman local alignment. The idea is to change computing so that it will be faster and resulting scores won t violate properties of metric so much, as they do now (for example by using borders to limit the computational space in the distance matrix). Acknowledgments. This research has been supported by grant GAUK provided by the Grant Agency of Charles University. 96

7 References HOKSZA: PROTEIN INDEXING Altschul, S.F., T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, , Bairoch, A., B. Boeckmann, S. Ferro, E. Gasteiger, Swiss-Prot: Juggling between evolution and stability, Brief. Bioinform., 5, 39-55, Cameron, M., H. E. Williams, A. Cannane, Improved Gapped Alignment in BLAST, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3), , Ciaccia, P., M. Patella, P. Zezula, : An Efficient Access Method for Similarity Search in Metric Spaces, VLDB 97, , Hoksza, D., T. Skopal, Index-based approach to similarity search in protein and nucleotide databases, DATESO 07, 67-80, Karlin, S., S.F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl. Acad. Sci., 87, , Mao, R., W. Xu, S. Ramakrishnan, G. Nuckolls, D.P. Miranker, On optimizing distance-based similarity search for biological databases, Proc IEEE Comput Syst Bioinform Conference, , Micó, M. L., J. Oncina, E. Vidal, A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements, Pattern Recognition Letters, 15, 9-17, Needleman S.B., C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 48, , Pearson, W.R., D.J. Lipman, Improved Tools for Biological Sequence Analysis, Proc. Natl. Acad. Sci., 85, , Sellers, P. H., On the theory and computation of evolutionary distances, SIAM Journal on Applied Mathematics, 26(4), , Skopal, T., J. Pokorný, V. Snášel, Nearest Neighbours Search using the P, DASFAA 05, , Skopal, T., Pivoting : A Metric Access Method for Efficient Similarity Search, DATESO 04, 21-31, Skopal, T., On Fast Non-metric Similarity Search by Metric Access Methods, EDBT 06, , Smith T.F., M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology, 147, , Xu W., D. P. Miranker, A Metric Model of Amino Acid Substitution, Bioinformatics, 20(8), , Zezula, P., G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach, Advances in Database Systems, Dayhoff, M.O., R.M. Schwartz, B.C. Orcutt, A model for evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5, , Henikoff S., J.G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci., 89, , Xu W., Miranker D.P., A metric model of amino acid substitution Bioinformatics, 20, , Weimin Ch., K. Aberer, Efficient querying on genomic databases by using metric space indexing techniques DEXA 07, 148, Ning, Z., A.J. Cox, J.C. Mullikin, SSAHA: A Fast Search Method for Large DNA Databases Genome Research, 11(10), , Kent, W. J. BLAT - The BLAST-Like Alignment Tool Genome Research, 12(4), , Ozturk O., H. Ferhatosmanoglu, Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases, Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering, 359-, Hsieh, T., H. Kuo, J. Huang, Filtering Bio-sequence Based on Sequence Descriptor BioDM 06, 14-23,

Computational Molecular Biology

Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive