Small Libraries of Protein Fragments through Clustering

Size: px

Start display at page:

Download "Small Libraries of Protein Fragments through Clustering"

Bernard Bradford
5 years ago
Views:

1 Small Libraries of Protein Fragments through Clustering Varun Ganapathi Department of Computer Science Stanford University June 8, 2005 Abstract When trying to extract information from the information contained in protein structures, dimensionality reduction is important. As the dimensionality of data source increases, data points that are actually similar to each other grow farther away. We analyze the method of data-reduction proposed by Kolodny et al. We approximate the three dimensional conformation of a protein by concatenating the structures of fragments drawn from a library of common motifs. We analyze the effects of the size of the library and the choice of clustering algorithm on the approximation error. Specifically, we extend the work done by Kolodny et al by applying both a spectral clustering algorithm and hierarchical clustering algorithm for the task of library generation in order to evaluate their effect on approximation error. We find that choice of clustering algorithm can have a significant impact. 1 Introduction A recent trend in bioinformatics is the tremendous growth in the amount of data avaiable. The Protein Data Bank for instance has been increasing in size at a very high rate. One important use of this information is exploring how protein sequence and structure affects function, as this enables medical progress and understanding of how biological life works. The task of extracting useful information from the protein data bank is difficult both computationally and conceptually because of the size and dimensionality of the data contained therein. One method of coping with large amounts of information is through abstraction. Therefore, there is a need for algorithms that can pare down data to only the essential components, in other words, algorithms for dimensionality reduction. By reducing the dimensionality of biological data as a preprocessing step, we enable subsequent algorithms that attempt to extract information from the data to work more effectively. Of course, it is important to make sure that one doesn t throw out the baby with the bath water, that is, any dimensionality reduction algorithm must conserve the relationships that are under study. We propose a method of compressing the information contained in the three dimensional coordinates of the atoms in a protein. We use unsupervised learning to discover a finite number of three-dimensional fragments that recur frequently in different protein structures. We express protein shape in terms of these motifs and analyze the approximation costs incurred. We attempt to exploit the following observations about proteins. It has been noted that protein shape may have more impact on protein function than protein sequence. Therefore, we ignore residue sequence and instead focus on the space of protein shape as described by the location of the carbon-alpha atoms. We follow in the line of previous researchers who have approached the same task. In 1986, Jones and Thirup discoverd that most regions of the protein backbone are comprised of repeating motifs [1]. Unger et al followed by finding motifs of up to ten residues long. These observations are interesting, because they indicate that in the vast representational space of three dimensional comformation of proteins with N residues, only a small amount of that space is densely populated by actual protein conformations. Here we build upon the recent work of Kolodny [2] who described a method for creating libraries of protein fragments for use in approximating protein structure. In that work, the dependence of approximation error on library size and fragment size was explored when using the K-means with simulated annealing clustering algorithm for library generation. In this paper, we also explore the dependence of approximation error on library size, fragment size, as 1

2 well as on the clustering algorithm. The goal is discovering whether clustering algorithms affect the effectiveness of library created, and if so which clustering algorithms are best. In our experiments, we apply Ng and Weiss s Spectral Clustering algorithm [3] as well as the standard hierachical with single linkage clustering algorithm. 2 Methodology Given a set of proteins and the associated three dimensional coordinates, we attempt to discover a set of structure fragments of fixed length. Our goal is to find a choice of structure fragments that achieve low approximation error when used to discretely approximate protein structure. Our approach for finding and evaluating the quality of such a set follows: 1. Divide each protein in the training set into sections of length F 2. Represent each fragment as a F length vector of the three dimensional coordinates of each residue in the fragment. 3. Calculate the distance using some measure between each pair of fragments. 4. Apply a clustering algorithm to divide the points into K clusters. 5. Choose the fragment that has the minimum distance to all other fragments in the same cluster as the centroid of the cluster. 6. For each section of proteins in the test set, find the cluster centroid that best approximates the shape locally ignoring the relative orientation of each section. 7. Calculate the approximation error 3 Distance Metric We follow Kolodny et al and use the coordinate root-mean-square (denoted crms) deviation of the C α atom to measure the structural similarity of any two fragments. We chose this metric for several reasons. First, it satisfies the triangle inequality which is important for its use in clustering. This distance metric can also distinguish between right-handed and left-handed structures. Moreover, since the structures are of fixed size, the problem of finding the optimal correspondence between the atoms is removed, making crms an extremely convenient choice without any major flaws. We applied Uneyama s algorithm to find the optimal least-squares superposition between each pair of structure fragments, and the crms distance is then calculated as the sum squared distance between the C α atoms. The actual similarity metric differs between hierarchical clustering and spectral clustering. In the latter, the affinity between any two structure fragments is calculated by e d(s i,s j ) 2 2σ 2. In the hierarchical clustering method, the distance is directly used to agglomerate the closest elements. 4 Clustering There are several decisions to be made in our methodology. For instance the number of clusters (the library size), the size of the fragments, and the clustering method are all variables. We consider using the Spectral Clustering algorithm developed by Andrew Ng and Yair Weiss. Here we briefly describe the algorithm and refer the reader to the NIPS 2004 paper for more information. We also briefly describe a standard hierarchical clustering algorithm that we used. 2

3 4.1 Spectral Clustering The Spectral Clustering algorithm proposed by Ng and Weiss combines spectral clustering methods with K-means. The algorithm takes as input the distance matrix containing the pairwise distance between each pair of elements in the set being clustered. An affinity matrix is formed that contains the pairwise attraction calculated as e d(x,y) 2 2σ 2 for each element x,y S where S is the set of structure fragments. Therefore, one other parameter, σ is required that controls how rapidly the affinity between two elements falls of with distance. After normalizing the affinity matrix in a specific way, the K largest eigenvectors of A are calculated where K is the number of clusters desired. The result is a matrix Y R n k The rows of this matrix are treated as a points x i R k and clustered using K-means to find k clusters. A line-search across different possible values of σ was performed and the value minimizing approximation error was chosen. According to theory and experiments described in the paper by Ng and Weiss, this spectral clustering algorithm works considerably better than K-means applied directly to the data. 4.2 K-Means clustering In order to implement the Spectral Clustering algorithm we used a standard K-means implementation. In order to avoid local minima we re-ran the algorithm five times with random starting points, and chose the clustering with the least sum squared distance to the centroids. 4.3 Hierarchical Clustering In Hierarchical clustering, pairs in close proximity to one another are grouped together. As objects are paired into binary clusters, the newly formed clusters are themselves grouped into larger clusters until only one cluster remains. The result is a hierarchical tree representing the data. After forming the tree, we cut the tree at the appropriate points to create libraries of a certain size. 5 Experiments The data set for this experiment consists of 90 proteins of various sizes chosen from the PDB. 1 The proteins were cut into fragments of size 5 and 7. These two sizes were chosen because in previous work these fragment sizes provided the best trade-off between complexity and approximation error. When divided into fragments of size 5 and 7, the total number of fragments produced was 3627 and 2575 respectively. Three different library sizes of 100, 150 and 250 was used. Finally, we applied both hierarchical clustering and spectral clustering. The results are shown in the tables and plots below. The average error column contains the error divided by the number of fragments in the test set. This metric seems to be a reasonable choice to use when comparing libraries of differing fragment length. 6 Implementation Details In the course of this work, Matlab code was written for each of the clustering algorithms used, namely Ng and Weiss s spectral clustering, K-means and Hierarchical Clustering. Moreover, Umeyama s method for RMSD calculation was also implemented in Matlab. The running time of the RMSD algorithm for calculating the distance between each pair of 2575 structure fragments of size 7 took approximately one hour on a 1.2 ghz Xenon machine. Moreover, the time for finding 250 clusters using Spectral Clustering was approximately 25 minutes on the same machine. The computational time for the spectral clustering algorithm was about equally divided between the cost of finding the top eigenvectors and running k-means clustering on the resulting points. The overall computational time was dominated by the cost of calculating the crms distance between all pairs of fragments. Observe that this is O(N 2 F 3 ) where N is the number 1 This data set was provided by Guha Jayachandran, the Teaching Assistant for CS273. The exact details of the creation of the data set are not known. 3

4 Table 1: Results of Experiments # Frags Frag Size Cluster Library Size Error σ = 1 σ = 2 σ = 3 Avg Error/Frag H E E H E E H E E S E E E E E S E E E E E S E E E E E H E E H E E H E E S E E E E E S E E E E E S E E E E E+00 of fragments and F is the size of each fragment. We used the algorithm decribed by Umeyama [4] which involves the use of SVD. 7 Discussion As expected, the approximation error decreased with the size of the library, except for one outlier. In the case of fragments of length 5 when using spectral clustering, mysteriously a library size of 100 produced much higher approximation error than either libraries of size 50 or 250. This is very strange. It seems likely that this was a result of K-means falling into a local-minimum. This could have been prevented by increasing the number of iterations in K-means to a higher number. When using spectral clustering, errors decreased when using fragments of longer length, which agrees with Kolodny s results. In contrast, when using hierachical clustering, the error actually increased when using fragments of longer length. Comparing the results of Hierarchical Clustering vs Spectral Clustering, we see that the latter uniformly acheives significantly lower error except for one anomolous point. This would indicate that the choice of clustering algorithm has a significant impact on the quality of the clusters for the purpose of structure approximation. 8 Future Work The work done here immediately suggests that a more thorough comparison of different clustering algorithms for the purpose of structure approximation would be interesting. While we have a presented a method for expressing the three-dimensional structure of a protein a more compressed format, the question remains, how can this lower-dimensional representation be used most effectively? Possibilities include clustering proteins after first approximating them in the manner we described. It is possibly that in a lower dimensional space, relationships between proteins might become more evident and more easily extractable. It has also been observed (and is very reasonable) that residue sequences on the active sites of proteins are evolutionarily less likely to change. The same should hold for structure fragments. It is possible, therefore, that evolutionary relationships might become more apparent when clustering using the structure fragment sequences rather than the original three dimensional structures. It is also more convenient since algorithms for sequence aligning are quite efficient, compared to algorithms for aligning the three dimensional structure of proteins of different lengths. In a different direction, machine learning can be applied to predict the structure fragment locally as a function of sequence. If it is possible to extract any correlations between sequence and structure fragment, then algorithms for predicting structure fragment from sequence could be used as features in general structure prediction algorithms. 4

5 20 Spectral Clustering Possibly K means Local Minimum 14 Normalized Error Size 5 fragments Size 7 Fragments Library Size Figure 1: A plot of the average squared error per fragment of clustering using fragments of length 5 versus fragments of length 7 with spectral clustering. Observe that fragments of length 7 have lower error, which agrees with Kolodny et al s results. 9 Acknowledgments We thank Professor Serafim Batzoglou and Professor Jean-Claude Latombe of the Stanford University Computer Science department for their guidance throughout this work. We also thank Guha Jayachandran for his kind assistance. References [1] A. T. Jones and S. Thirup. Using known substructures in protein model building and crystallography. EMBO J. 5, [2] L. Kolodny R. Koehl P. Guibas and M. Levitt. Small Libraries of Protein Fragments Model Native Protein Structures Accurately. JMB, [3] Ng A. Jordan M. and Weiss Y. On Spectral Clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14, [4] S. Umeyama. Least squares estimation of transformation parameters between two point sets. IEEE Trans. Pattern Anal. Mach. Intell. 13 (4),

6 9 8 Size 7 Fragments 7 Normalized Error Size 5 Fragments Library Size Figure 2: A plot of the average squared error per fragment of hierarchical clustering using fragments of length 5 versus fragments of length 7. Observe that fragments of length 5 have lower error, which does not agree with Kolodny s results. 6

A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

Karl Gutwin May 15, 2005 18.336 A Parallel Implementation of a Higher-order Self Consistent Mean Field Effectively solving the protein repacking problem is a key step to successful protein design. Put