An optimal algorithm for counting network motifs

Size: px

Start display at page:

Download "An optimal algorithm for counting network motifs"

Terence Ellis
5 years ago
Views:

Physica A 381 (2007) 482 490 www.elsevier.

received in revised form 14 February 2007 Available online 6 March 2007 Abstract Network motifs are small connected sub-graphs occurring at significantly higher frequencies in a given graph compared

1 Physica A 381 (2007) An optimal algorithm for counting network motifs Royi Itzhack, Yelena Mogilevski, Yoram Louzoun Math Department, Bar Ilan University, Ramat-Gan, Israel Received 7 January 2007; received in revised form 14 February 2007 Available online 6 March 2007 Abstract Network motifs are small connected sub-graphs occurring at significantly higher frequencies in a given graph compared with random graphs of similar degree distribution. Recently, network motifs have attracted attention as a tool to study networks microscopic details. The commonly used algorithm for counting small-scale motifs is the one developed by Milo et al. This algorithm is extremely costly in CPU time and actually cannot work on large networks, consisting of more than 100,000 edges on current CPUs. We here present a new optimal algorithm, based on network decomposition for counting K-size network motifs with constant memory costs and a CPU cost linear with the number of counted motifs. Our algorithm performs better than previous full enumeration algorithms in terms of running time. Moreover, it uses a constant amount of memory. It also outperforms sampling algorithms. Our algorithm permits the counting of three and four motif for large networks that consists of more than 500,000 nodes and 5,000,000 links. For large networks, it performs more than a thousand times faster than current algorithms. r 2007 Elsevier B.V. All rights reserved. Keywords: Graph; Networks; Motif; Algorithm 1. Introduction Milo et al. [1] defined motifs as basic interaction patterns recurring throughout different kinds of networks more often than in random networks with the same degree distribution. In biological networks, a small set of network motifs appears to serve as the building blocks of transcription networks from bacteria to mammals [2]. Specific network motifs are also found in signal transduction networks, neuronal networks and other biological and non-biological networks [3]. The analysis of network motifs also plays a role in network classification [3] and the analysis of structural network properties. A large amount of work was devoted to the interpretation and application of network motifs, but much less effort was devoted to the development of good motif counting algorithms. In general, we can divide the requirements for a k-motif counting algorithm Corresponding author. Tel./fax: address: ylouzoun@gmail.com (Y. Louzoun) /$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi: /j.physa

2 in a given graph to three main elements [4]: R. Itzhack et al. / Physica A 381 (2007) (1) Counting all k subgraphs occurring in the graph. (2) Determination of which of these subgraphs are isomorphic, and count only once every isomorphic groups. (3) Comparison of the motif number with the expected number in a random graph with the same connectivity structure. Performing the first subtask (counting all connected K-size subgraphs) by explicitly enumerating all subgraphs of a certain size is extremely time consuming due to their potentially large number even in small, sparse networks. One attitude proposed to overcome the high CPU cost is motif sampling developed by at Kashtan et al. [5] or Wernicke [4]. Random sampling algorithms are efficient algorithms, which can successfully approximate the expected number of network motifs using a small number of samples. Such algorithms collect samples from the whole network by randomly picking an edge adjacent to the current edge until it completes a k-size subgraph, and processing only a small number of sampled subgraphs. Sampling methods improve the running time dramatically. However, such methods can only estimate the frequency of subgraphs and cannot provide an exact enumeration. In this paper, we refine the first and second tasks by optimizing the motif count to the level that every motif is counted once and only once, with practically no overhead. We here show the application of the algorithm for the measurement of three and four subgraph occurrences. The time complexity of our algorithm is low enough to measure directly k-motifs on any graph of up to millions of edges. It actually performs similarly or better than sampling algorithms. The last version of the motif counting algorithm provided by Pr Alon is denoted mfinder1.2 [6]. This algorithm initiates the subgraph searching, by choosing a random edge, and extending the edge iteratively from both ends until it gets a k-size subgraph [5]. The number of subgraphs increases approximately as the number of edges to the power of k, while the runtime increases much faster than that for subgraphs with kx3, especially for a large number of nodes. We here present a new approach for the exact counting of network motifs consuming a minimal running time. The approach is based on network decomposition [7], via node removal. We detect all motifs containing a given node by measuring all incoming and outgoing neighbors of degree k 1, and then remove this node. We present our algorithm results for k ¼ 3 and 4 and describe the k ¼ 5 algorithm. There is no point in developing motifs counting algorithms for k45, since there are 1,530,843 different k ¼ 6 subgraphs, making the results practically impossible to decipher and understand. However, if one is interested in a specific k-motif frequency for k45, the same algorithm can be enlarged to any k. The rest of this paper is organized as follows. We first present the definitions and motif counting algorithm. We then present some computational results on motifs extracted from random (Erdos Renyi, ER) and scale free networks to compare the efficacy of our algorithm with standard counting and sampling algorithms. 2. Results The k-motifs counting problem is defined as the task of enumerating all the connected patterns of subgraph G k G of size k. Ak-motif is represented by k k size connectivity matrix, and all the possible isomorphic matrices. For example, the A-B-C motif can be represented by a connectivity matrix between A, B and C, where all values are zero except for the (A,B) and (B,C) cells. Replacing the rows and columns of A by B and B by C would produce the same motif but a different matrix. Not all connectivity matrices are defined as motifs, since the subgraph has to be connected. Thus, each node in a k-size motif must contain an undirected walk between all nodes. In a given directed graph G(V,E), where V is the set of nodes and E is the set of edges, the time complexity of finding all the subgraphs G k G has an upper bound of O(E k ) [8] and a lower bound of O(V c k 1 ). We here present an optimal motif counting algorithm with an efficacy close to the lower bound, where each motif is counted precisely once, using a constant amount of memory (that does not depend on the network size). The algorithm is based on the decomposition of the original network through the systematic removal of nodes. We choose a random node. We count all motifs containing this node, using a memory

3 484 R. Itzhack et al. / Physica A 381 (2007) Fig. 1. Counting pattern of motifs of size 3, 4 and 5. Starting from left to right: the first count is for all the permutations of the second level. The next count is permutations of (k 2Þ nodes from the first level and 1 from the second level and so on. structure allowing the direct enumeration of all motifs containing a given node. Once all the motifs containing a given node are counted, the node is removed from the network, and the process is reiterated for the next node. The resulting time complexity for all size k subgraphs, given an average of total (incoming and outgoing) neighbors number of c is less than OðVc k 1 log k 2 ðcþþ, where c5voe. Scale free network differs significantly from ER networks. While in scale free networks, the systematic removal of the highest degree nodes decomposes the network after a small number of steps [7], iner networks, one must remove enough nodes to bring the network below the percolation level. Based on this decomposition principle, we build for each node a k-motif tree similar to a Breadth first Search (BFS) tree. We count in this tree each motif containing the node exactly once, and remove the node. In a graph GðV; EÞ, the k neighborhood V k i of node v i is defined as the subset of nodes within a distance of at most k 1 edges from or to v i. V k i is spanned by a k-motif tree T k i, the first level of T k i is v source ¼ T 1, T L represents all the nodes in the l-level of T k i. The T L level also consists of all the neighbors of each node t j 2 T L 1 that are not contained in the level L 1, The similarity of the k-motif spanning tree to the regular BFS tree is its breadth span of the local neighborhood. The uniqueness of the spanning tree T k i is that the breadth span is without direction and in each level, nodes can appear more than once, depending on their ancestors (Fig. 2). All motifs containing the node v i are connected k nodes sub-graphs of T k i and inversely all connected k nodes subgraphs of T k i are motifs. However, different subgraphs of T k i can be the same motif. In order to count each such motif only once, we have developed a counting pattern. The counting pattern is based on a systematic increase in the analyzed depth. We first count all subgraphs of depth 1 (i.e. v i and its (k 1) of its direct neighbors). We then count all motifs with (k 2) of level 1 and one node of level 2 and so on. In Fig. 1, we show the application of this principle for k ¼ 3 5. For example, in the 4-motif count, we first count v source with additional 3 nodes from level two, next (from left to right), we then count permutations of two nodes from level two and one of their sons from level three, the next count is all the possibilities of one node from level two and two of his sons from level three, the last pattern is one node from level two, his son from level three, and his grandson from level four. The exact recognition of the motif is also fast and performed with a CPU cost of O(1) and a constant memory usage, as will be further explained. Finally after passing on the motif tree spanned by v source, we remove v source and all edges connected to it from the network. The motif enumeration through the graph does not consume memory. 3. Four size motif algorithm count We now describe in details the k ¼ 4 motif counting algorithm, the algorithm is similar for all other k. The four-motif count algorithm is based on four counting patterns. The first count pattern is all v source and three of his sons (Fig. 1(4), left drawing). The second pattern is all the possibilities of v source with two of his sons (including t 2 from T 2 and one of his grandson that t 2 is his father (Fig. 1(4), middle left drawing)). The third pattern is v source with his son t 2 and two of t 2 s sons (Fig. 1(4), middle right drawing). The last pattern is v source his son t 2, t 2 s son t 3 and t 3 s son (Fig. 1(4), right drawing). Different counts can actually represent the same motif. For example, if t 2 s son t 3 is also his brother, we would count every motif containing v source, t 2, t 3 twice.

4 R. Itzhack et al. / Physica A 381 (2007) Fig. 2. Motif counting tree of a directed network. The leftmost diagram is the network itself. We span a tree from v source ¼ 1. The middle tree spans three motifs and the rightmost tree spans four motifs. Each level consists of either incoming or outgoing edges to one of the fathers at the level above. Nodes can appear more than once at a specific level with a specific root, but not at different levels of the same root. In order to count every motif only once, we move from left to right in the tree and before including a node in a count, we check that it does not appear as one of the ancestors brother. Note that in the k ¼ 5, we also avoid cousins and so on. In order to minimize the cost of the search, we first order all nodes. The neighbors of every nodes are ordered when the BFS tree is created. The cost of searching for uncles is thus only O(log(c)). The ordering of the graph is done only once with the marginal cost of O(E log(c)). We examplify the process on a simple tree (Fig. 2). We first insert the first ordered couple (1,2). We then insert the first ordered triplet (1,2,3) and only then count all the quadruplets in the first counting pattern (i.e. (1,2,3,4), (1,2,3,5), (1,2,3,6). Once done with the first counting pattern based on the (1,2,3) triplet, we move to the second counting pattern based on it, and count (1,2,3,7), (1,2,3,9) and (1,2,3,10). For each such count, we check if the last node that added to the quartet is not its own uncle or left cousin, the cost of each such checking is either log(c) or log 2 ðcþ. We then continue with the next triplet of (1,2), i.e. (1,2,4) and proceed in the same way. We obviously do not take into account at this stage any node to the left of 4. After the (1,2,6) triplet is done, we move to the quadruplets containing (1,2,7) (1,2,7,9), (1,2,7,10). Note again that we do not count (1,2,7,5), since 5 is his own uncle. If 10 was also a son of 1, we would not count (1,2,7,10) either. We then continue in a similar way with all triplets containing (1,2). At the end of the (1,2) pair count, we move to the (1,3) pair count and proceed in the same way with all parts of the tree to the right of 3. The time complexity of the algorithm is Oðc 3 log 2 ðcþþ for a single node. When all motifs involved in a node are counted, the node is removed from the network, with a cost of O(c). Note that E (the number of edges) and following it c decays as we decompose the network. The algorithm cost is thus bounded by OðVc 3 log 2 ðcþþ ¼ OðEc 2 log 2 ðcþþ. The double check that we have to make on all the pattern counts, except the first contribute the factor of log 2 ðcþ, but since c is usually very small it does not affect the CPU cost in significant way. The log k 2 ðcþ factor is the difference between the cost of our algorithm and the number of motifs, the code of the 4 count motif algorithm is demonstrated in Fig. 3, the colors represents special motif pattern count. 4. Motif recognition The above-mentioned algorithm only checks if a quadruplet represents a motif, but it does not check which motif it is. In order to relate a quadruplet to a 4 motif in a single operation, each quadruplet is represented by 4 4 size square matrix M. M ij is 1 if there is direct edge from node i to node j, otherwise it is 0. Since we only treat simple graphs, M ii is set to 0. The matrix is then represented by a k (k 1) bitstring with 2 k (k 1) possible values. Each such bitstring represents a single motif, although multiple bitstring can represent the same motif. There are 64, 4096 and 1,048,576 different possible bitsrings and 13,199,5946 motifs for k ¼ 3, 4 and 5, respectively. For k ¼ 6, there are over 10 9 possible bitstrings and over 10 6 motifs making the interpretation very cumbersome, we thus limit our code to k ¼ 5 motifs. Once a set of nodes representing a motif is counted, we add one to a motif counting array at the cell represented by the appropriate bitstring. The cost of this stage is k 2.

Four motif count algorithm: we perform four count algorithm, the count patterns located according to the legend at bottom, first we count source node and all the permutations that assemble three from

5 486 R. Itzhack et al. / Physica A 381 (2007) Fig. 3. Four motif count algorithm: we perform four count algorithm, the count patterns located according to the legend at bottom, first we count source node and all the permutations that assemble three from his neighbors, after we count source node with his sons and one of his grandsons, the third pattern is source node with one of his son and two of his grandsons, the count of all the possibilities of source node and one of his sons, one of his grandsons and great grandsons. The function count motif preformed as described in motif recognition part. At the end of the entire process, we sum all the array cells, whose bitstring represents isomorphic motif to get the full motif count. In order to find all the isomorphic patterns, we calculate for each motif pattern all the k! permutations of switches between rows and columns in the matrix, for each permutated matrix we remove the diagonal and update the number of the original motif at the place at the table that is equal to the value of the permutated bit string. 5. Running time comparisons We have used two data sets for the comparison of the running time of the mfinder exhaustive algorithm and our algorithm. The first data set is Erdos Reiny random networks sizing from 10 to 50,000 nodes, with varying average connectivities. We have varied the connectivity and kept the network size constant and the varied the network size, while maintaining the connectivity constant. The second data set is scale free networks with 50 50,000 nodes and a power of 2. The running time of the networks was compared for the 3 and 4 motif algorithms. The running time of the mfinder exhaustive search method is much slower than that of our algorithm either for 3 or 4 motif. For example, the mfinder run time for a network composed of 50,000 nodes and 400,000 edges was 6 days compared with 1.2 min in our algorithm (Figs. 4 and 5). The run time ratio between our algorithm and mfinder increases from 10 to 500, as we increase the network s size from 10 to

6 R. Itzhack et al. / Physica A 381 (2007) Node Number Nodes Number Fig. 4. Three motif counting running time comparison for the original network and randomized networks as a function of the network size in ER networks. The dotted blue line represents our algorithm for 100 networks, the dashed red line represents the FANMOD sampling algorithm for a 1000 networks and the full green line represents the mfinder1.2 algorithm for 100 networks. The networks had an average connectivity of eight neighbors per node. The inner figure shows the comparison in logarithmic scale. The growth rate of the running time is bigger in the sampling algorithm than in our algorithm. 50,000 nodes (with a constant connectivity). Similar results are obtained for the ER networks when increasing the connectivity from 5 to 50 with constant network size (1000 nodes) (Fig. 6) and for SF networks (Fig. 7). 6. Comparison to sampling algorithms The minimal CPU and memory cost of our algorithm allows us to enlarge the algorithm to higher values of k and to networks that were not considered possible before. This algorithm, while providing a full motif enumeration, actually outperforms sampling algorithms. We have compared our results with the FANMOD algorithm that samples 100,000 subgraphs of each network (which in cases of large networks is less than 0.01% of the total subgraph number in the network). In order to find the motifs significantly over(under) represented in our algorithm, we compare the frequency of each motif in the original network with its frequency in another 100 networks of similar degree distribution, as proposed by Kashtan et al. [5]. Note that FANMOD needs a larger comparison due to the limitations of the sampling method. For all the abovementioned networks, the running time of the FANMOD sampling algorithm was higher than that of our algorithm for both 3 and 4 motifs. For example, the running time for a network composed of 5000 nodes and 15,000 edges the running time of FANMOD (for the original network and 1000 networks of similar degree distribution) was 6 h compared to 12 min with our algorithm (with a 100 networks of similar degree distribution). Even assuming that both algorithms require the same amount of random networks, our algorithm would outperform FANMOD. Moreover, the running time ratio between the FANMOD algorithm and our algorithm increases (weakly) with network size and connectivity (Figs. 4 7). To summarize, not only is our analysis more accurate, it is much faster than sampling methods.

7 488 R. Itzhack et al. / Physica A 381 (2007) Node Number Node Number Fig. 5. Four motif counting running time comparison for the original network and randomized networks as a function of the network size in ER networks. x Node Number Nodes Number Fig. 6. Four motif running time comparison for scale free networks as a function of the network size. The dotted blue line represents our algorithm for 100 networks, the dashed red line represents the FANMOD sampling algorithm for 1000 networks and the full green line represents the mfinder1.2 algorithm for 100 networks. The inner figure shows the comparison in logarithmic scale. The results are similar to the ER results.

8 R. Itzhack et al. / Physica A 381 (2007) x Nodes Average Connectivity Nodes Average Connectivity Fig. 7. Four motif comparison for randomized networks as a function of the connectivity. The networks consists of 1000 nodes and the number of edges grows from 5000 to 50,000 edges. The dotted blue line represents our algorithm for 100 networks, the dashed red line represents the FANMOD sampling algorithm for a 1000 networks and the full green line represents the mfinder1.2 algorithm for 100 networks. The inner figure shows the comparison in logarithmic scale. 7. Discussion Motif counting is now a common practice in the analysis of networks [9 15]. This practice was limited up to now to small networks, since the standard algorithms are extremely slow, and memory costly for large networks. In order to approximate the number of motifs in large networks, sampling methods were developed [4,5]. We here present an optimal algorithm to count k-size motifs that is fast enough to allow motif counting in networks containing millions of edges in a reasonable time, and a constant amount of memory. The efficacy of our algorithm makes it often even more efficient than sampling methods, allowing a precise count where only approximations were possible before. The cost difference between our algorithm and the standard counting algorithm grows sharply with motif size, connectivity and node number. For example in a 50,000 nodes and 400,000 edges ER network, our algorithm is over 2000 times faster than the standard algorithm (5 days compared with 3.5 min). For larger networks, we could not compare the performances, since the standard algorithm cannot handle their motif count. The main principle of our algorithm is the counting of all motifs passing through a node and the removal of this node. The systematic removal of nodes eventually leads to the network decomposition. We further improve the efficiency of the algorithm by avoiding subgraph redundancy among the neighbors of a given node. Network decomposition can be done in SF networks by the removal of a small number of high degree nodes (hubs). However, the cost of computing all motifs passing through these nodes is very large and the order at which nodes are removed from the network has a minor effect on the algorithm efficacy (less than a factor of 2). We have also optimized the network scrambling required to compare the network to random networks with similar one-directional and bidirectional edge distributions, but this did not affect the running time, since the main cost is the motif search and not scrambling. Finally, our fast algorithm opens the way for the exact analysis of large networks, such as the WWW, without the requirement to depend on estimations of sampling methods.

9 490 R. Itzhack et al. / Physica A 381 (2007) References [1] R. Milo, S. Shen-Orr, S. Itzkovitz, et al., Science 298 (5594) (2002) 824. [2] S.S. Shen-Orr, R. Milo, S. Mangan, et al., Nat. Genet. 31 (1) (2002) 64. [3] R. Milo, S. Itzkovitz, N. Kashtan, et al., Science 303 (5663) (2004) [4] S. Wernicke, F. Rasche, Bioinformatics 22 (9) (2006) [5] N. Kashtan, S. Itzkovitz, R. Milo, U. Alon, Bioinformatics 20 (11) (2004) [6] Mfinder1.2 software. / [7] R. Albert, H. Jeong, A.L. Barabasi, Nature 406 (6794) (2000) 378. [8] N. Alon, R. Yuster, U. Zwick, Algorithmica 17 (3) (1997) 209. [9] M. Babu, N.M. Luscombe, L. Aravind, et al., Curr. Opin. Struct. Biol. 14 (3) (2004) 283. [10] D. Li, J. Li, S. Ouyang, et al., Proteomics 6 (2) (2006) 456. [11] Y. Louzoun, L. Muchnik, S. Solomon, Bioinformatics 22 (5) (2006) 581. [12] H.W. Ma, B. Kumar, U. Ditges, et al., Nucleic Acids Res. 32 (22) (2004) [13] R.J. Prill, P.A. Iglesias, A. Levchenko, PLoS Biol. 3 (11) (2005) e343. [14] R.V. Sole, S. Valverde, Trends Ecol. Evol. 21 (8) (2006) 419. [15] O. Sporns, J.D. Zwi, Neuroinformatics 2 (2) (2004) 145.

Efficient Counting of Network Motifs

Efficient Counting of Network Motifs Dror Marcus School of Computer Science Tel-Aviv University, Israel Email: drormarc@post.tau.ac.il Yuval Shavitt School of Electrical Engineering Tel-Aviv University,