Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs

Size: px

Start display at page:

Download "Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs"

Percival Dalton
5 years ago
Views:

1 Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs N. F. Samatova 1,2,+, M. C. Schmidt 1,2,, W. Hendrix 1,2,, P. Breimyer 1,2, K. Thomas 3 and B.-H. Park 2 1 Computer Science Department, North Carolina State University, Raleigh, NC 27695, USA 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA 3 Cray, Inc. Seattle, WA 98104, USA Both authors contributed equally. + Corresponding author. samatovan@ornl.gov Abstract. Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP -hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation. 1. Introduction Many biological problems often reduce to graph problems, with the maximal clique enumeration (MCE) problem being ubiquitous. The solutions of the MCE problem are used, for example, to align 3-dimensional protein structures [1], to integrate genome mapping data [2], to identify coexpressed genes [3], to identify common secondary structure elements of proteins [4], to detect protein-protein interaction complexes [5], to cluster similar mass spectrometry spectra [6], and to find clusters of orthologous genes [7]. The MCE problem is NP -hard [8], and its run time, in practice, scales exponentially with the problem size, unless P = N P [9]. The challenge remains of how to scale an MCE algorithm to graphs with hundreds, thousands, or even millions of vertices, which are not unusual for the problems considered in these examples. c 2008 Ltd 1

2 The problem is exacerbated when, in addition to the size of the graphs, the uncertainty, or noise, in the data from which these graphs are derived is taken into account. In this case, multiple solutions of the MCE problem are often sought for various perturbed graphs. Perturbations may be induced by filtering out some edges due to applied edge weight cutoffs or by adding edges based on additional orthogonal information sources. For example, two genes in a gene expression network can be viewed as coexpressed (i.e., connected by an edge) if their Pearson correlation derived from microarray data is above a certain threshold; various thresholds will correspond to different network perturbations. Likewise, two proteins in a protein-protein interaction network can be considered interacting if, in addition to genomic-context information (e.g., their neighborhood colocation on the genome), mass spectrometry pull-down experiments become available. The challenge is to support such what-if explorations of biological graphs that are not only large but also uncertain and changing. In order to address both challenges, two major, interrelated advancements have been achieved. On the one hand, we developed a scalable, parallel MCE algorithm. It not only efficiently handles the data-intensive nature of the MCE problem but offers linear speedup, even using thousands of processors to run the algorithm on real-world biological graphs with thousands or hundreds of thousands of vertices. To the best of our knowledge, this is the first parallel MCE algorithm that scales to this number of processors on such large-scale, real-world problem sizes. On the other hand, we proposed and advanced a new theory, which we call Graph Perturbation Theory (GPT), that establishes a foundation for solving graph problems in perturbed graphs. The intuition behind GPT is quite simple: if a solution to the reference or unperturbed graph is known, then it can be used to find the exact solution for the perturbed graph more efficiently than complete recomputation, especially when the perturbation is relatively small. Specifically, we formulated necessary and sufficient conditions for the maximal cliques that are induced or eliminated by the addition or removal of an arbitrary number of edges to a reference graph. Based on this theory, we produced a practical MCE algorithm implementation for perturbed graphs that enumerates the maximal cliques of a perturbed graph by using efficient indexing of features derived from the MCE solution for the unperturbed graph to detect the changes in the composition of maximal cliques induced by the target perturbations. We demonstrated more than 80% efficiency improvement compared to the traditional enumeration of maximal cliques in protein interaction networks for multiple organisms, even when the number of added edges (i.e., perturbations) ranged between 20% and 136%. 2. Parallel MCE algorithm Our current parallel MCE algorithm is a parallelization of the widely used method of enumerating maximal cliques developed by Bron and Kerbosch (BK) [10]. Our previously developed pclique [11], the first parallel MCE algorithm, extends the algorithm of Kose et al [12] (dubbed as KOSE). In principle, KOSE is identical in spirit to the BK algorithm; it branches using alphanumeric ordering. However, whereas BK is a recursive algorithm with depth-first search (DFS) branching, KOSE is a serialized algorithm with breadth-first search (BFS) branching (see [13] for DFS and BFS definitions), which allows cliques of size k to be generated from cliques of size k 1. Consequently, all maximal cliques are produced in lexicographic order, which is an invaluable asset in certain applications. Nevertheless, the BFS branching strategy inevitably makes KOSE memory-intensive. Although pclique improves KOSE performance by using bitvector manipulation of common neighbors, the huge memory requirements remain unchanged. This limitation affects both the size of the graphs that can be handled by pclique and the speedup achieved by using more processors. As a result, on an SGI Altix 3700 machine, pclique achieved a speedup factor of just 91 on 256 processors for a graph of 2,895 vertices and 10,914 edges. This nonideal scaling motivated the development of a parallel DFS-based BK algorithm. To enable the parallelization of BK, we proposed an effective decomposition of the BK search 2

3 tree, in which leaf nodes represent maximal cliques, into subtasks of generating the child nodes of an interior node. To allow this decomposition, we introduced a candidate path data structure containing the minimal information required for a BK search subtree exploration [7, 14]. The difficulty in applying this decomposition lies in the fact that the nature of the BK search tree makes it impossible to determine a well-balanced distribution of subtasks a priori. In particular, the size and number of cliques in the subtree beneath a particular tree node are unknown until that subtree has been fully generated. Thus, certain unlucky computing elements may generate subtrees with more and larger maximal cliques than others. Without allowing these overloaded computing nodes to transfer work to underloaded nodes, the execution time for different computing elements may differ greatly (see figure 1). Figure 1. Clarifying example of the impact of dynamic load-balancing (DLB) on the execution times of the various processes. The black bars give the finishing times of the 16 processes used to run the parallel algorithm without dynamic load balancing. The white bars represent the finishing times with dynamic load balancing. The graphs were obtained by running the parallel algorithm on the Shewanella oneidensis gene expression graph. The load balancing scheme we proposed intelligently couples a dynamic (runtime) work stealing process [15] with a stack splitting procedure [16] in order to minimize the idle (noncomputing) time over all processors. The amount of work for a computing element can be measured by the number of candidate path structures left in that computing element s stack. However, because the number of candidate paths remaining in a stack decreases gradually and increases rapidly over the course of the algorithm s execution, predicting exactly when a computing element will become idle is virtually impossible. To overcome this, we implemented a receiver-initiated scheme, allowing a computing element to become almost idle (stack size below some threshold) before requesting more work. When this threshold is reached, the idle computing element requests more candidate paths from another randomly chosen computing element. (This process, called random polling, is one of the most efficient methods of requesting work when the underlying architecture of the computing system is unknown [15].) If the randomly chosen computing element has work available, it sends the candidate path structures most likely to represent large subtrees, a procedure motivated by the concept of stack splitting [16]. If the responding computing element has no work, then the requesting computing element selects another random computing element and repeats the request, with the program terminating when all computing elements are idle. Figure 2 shows that the speedup of the algorithm is linear, and thus the initialization and idle time are small relative to the total execution time. Parallel execution of the program is achieved by generating multiple processes, each capable of spawning multiple threads [17]. Interprocess communication is performed using MPI communication, and the threaded behavior of the application is enabled using POSIX threads 3

4 Figure 2. Speedup of the parallel algorithm on the Saccharomyces cerevisiae protein interaction network with between 1 and 2,048 processes on a Cray XT4. (Pthreads). Each process is assumed to have its own memory that its associated threads share. This hybrid parallelism is motivated by the fact that many modern high-performance machines consist of clusters of symmetric multiprocessing (SMP) units. By combining both sharedmemory and distributed memory parallelism techniques, better performance is achieved. In addition, the implementation is portable across different computer architectures. 3. Graph perturbation theory and algorithms The basic idea behind graph perturbation theory is to examine the differences (added or removed edges, in our case) between two graphs an original graph, for which the maximal cliques have already been enumerated, and a perturbed graph and to list only the set of maximal cliques that are introduced and destroyed by the perturbation. By leveraging the enumeration of the original graph, the maximal clique enumeration of the perturbed graph may be calculated more quickly. Intuitively, if the perturbation between the two is relatively small, the two difference sets will be smaller than the full enumeration for the perturbed graph. The following basic definitions are necessary before setting out our theory. Let G and G new denote the original and perturbed graphs, respectively, and let C and C new be the maximal clique enumeration of each. Define the difference sets C + = C new \ C and C = C \ C new. Theorems 3.1 and 3.2 establish simple necessary and sufficient conditions for containment in C + and C, respectively. Theorem 3.1. C C + if and only if C is a maximal clique in G new that contains some edge being added to G. Proof. Clearly, if a maximal clique A C + contained some edge being added to G, C would not be a clique in G, so A C +. Let C C new be such that C contains no edge being added to G. As such, C is be a clique of G. Thus, either C C, or there must be some C C that strictly contains C. By our definition of C +, C / C + if C C. Also, since no edges are being removed from G, C would have to be in C new, but this contradicts our assumption that C is a maximal clique in G. Theorem 3.2. A maximal clique C C if and only if C is a clique in G and C is a subset of some C C +. 4

5 Proof. Let C be an arbitrary maximal clique of C. Since no edges are being removed from G and C is a clique in G, C must be a clique in G new. Thus, C / C new if and only if C is not maximal, and C is not maximal if and only if there exists some clique C C new such that C is a proper subset of C. Such a C could not be a maximal clique of G as this would contradict the maximality of C, so C C +. By Theorem 3.1, we know that all cliques of C + are maximal cliques of G new containing some edge being added to G. Thus, to calculate this set, we use a modified version of the BK algorithm. On weighted protein-protein interaction networks for nine different organisms from [18], where the weight of each edge represents the probability two proteins interact, we generated graphs by applying thresholds at probabilities 0.75 and The perturbations introduced by lowering the threshold to 0.70 accounted for 20 40% of the edges in the networks for 6 of the 9 organisms. The networks for S. typhimurium and M. tuberculosis saw healthy perturbations of 48% and 68%, respectively, but the E. coli network underwent a full 136% change in its number of edges. After calculating the maximal clique enumeration for the network of each organism using the cutoff 0.75, we calculated the maximal clique enumeration for the graph induced by the cutoff 0.70 via both the perturbed graph algorithm as well as a single-threaded version of the original BK implementation. The percentage improvement of the perturbational algorithm over BK appears in figure 3. Figure 3. Percentage runtime improvement of perturbational algorithm over BK for the induced perturbations As shown in the figure, all clique enumerations were produced by the perturbational algorithm in 50 85% less time than full recalculation by BK even for the E. coli network, where more edges were added than existed in the original graph. While these results favor the perturbational algorithm, the algorithm performed better on this very large perturbation than would be suggested by intuition. Better performance is typically observed under smaller (less than 20%) perturbations applied to the reference graph (results are not reported). 4. Conclusion We reported a novel capability for efficient enumeration of maximal cliques in biological graphs derived from large-scale, uncertain, and dynamically changing biological data. We demonstrated 5

6 the first parallel MCE algorithm that scales linearly on thousands of processors for real-world biological networks with thousands and hundreds of thousands of vertices. We proposed the Graph Perturbation Theory (GPT) that takes advantage of the solution provided by parallel MCE on the reference graph to significantly reduce the time required to solve the MCE problem on the perturbed graphs. We developed a practical implementation of the perturbed MCE algorithm that utilizes efficient database indices, constructed using the GPT theory, to achieve improved performance. The application of the MCE algorithms to real-world biological networks across multiple organisms has been demonstrated. Acknowledgments The authors are thankful to Cray Inc. for the access to large-scale Cray XT systems and the insights into the code optimization and benchmarks. This research has been supported by the Exploratory Data Intensive Computing for Complex Biological Systems project from U.S. Department of Energy (Office of Advanced Scientific Computing Research, Office of Science). The work of NFS was also sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle for the LLC U.S. D.O.E. under contract no. DEAC05-00OR References [1] Chen Y and Crippen G M 2005 Protein Science [2] Harley E, Bonner A and Goodman N 2001 Bioinformatics [3] Rokhlenko O, Wexler Y and Yakhini Z 2007 Bioinformatics 23 e184 e190 [4] Grindley H M, Artymiuk P J, Rice D W and Willett P 1993 Journal of Molecular Biology [5] Zhang B, Park B H, Karpinets T and Samatova N F Bioinformatics (Oxford, England) [6] Tabb D L, Thompson M R, Khalsa-Moyers G, VerBerkmoes N C and McDonald W H 2005 Journal of the American Society for Mass Spectrometry [7] Park B H, Samatova N F, Karpinets T, Jallouk A, Molony S, Horton S and Arcangeli S 2007 SciDAC 2007 vol 78 (Boston, Massachusetts) [8] Lawler E L, Lenstra J K and Kan A H G R 1980 SIAM Journal on Computing [9] Garey M R and Johnson D S 1979 Computers and Intractability: A Guide to the Theory of NP-Completeness (WH Freeman & Co. New York, NY, USA) [10] Bron C and Kerbosch J 1973 Communications of the ACM [11] Zhang Y, Abu-Khzam F, Baldwin N, Chesler E, Langston M and Samatova N 2005 Supercomputing, Proceedings of the ACM/IEEE SC 2005 Conference p 12 [12] Kose F, Weckwerth W, Linke T and Fiehn O 2001 Bioinformatics [13] Cormen T, Leiserson C E, Rivest R L and Stein C 2001 Introduction to Algorithms 2nd ed (McGraw-Hill) [14] Park B H, Schmidt M, Thomas K, Karpinets T and Samatova N F 2008 Upcoming in Proceedings of IPDPS 2008 [15] Kumar V, Grama A Y and Vempaty N R 1994 Journal of Parallel and Distributed Computing [16] Finkel R and Manber U 1987 ACM Trans. Program. Lang. Syst [17] Thomas K, Samatova N F, Schmidt M and Park B H 2008 Upcoming in Proceedings of CUG 2008 [18] Flannick J, Novak A, Srinivasan B S, McAdams H H and Batzoglou S 2006 Genome Research

Parallel, Scalable, Memory-Efficient Backtracking for Combinatorial Modeling of Large-Scale Biological Systems 1

Parallel, Scalable, Memory-Efficient Backtracking for Combinatorial Modeling of Large-Scale Biological Systems 1 Byung-Hoon Park *, Matthew Schmidt *,+, Kevin Thomas #, Tatiana Karpinets *, Nagiza F. Samatova