3 Global Properties of Networks

Size: px

Start display at page:

Download "3 Global Properties of Networks"

Jeffry Richard
5 years ago
Views:

1 3 Global Properties of Networks Ralf Steuer a and Gorka Zamora-López b a Humboldt University Berlin, Institute for Theoretical Biology, Invalidenstr. 43, Berlin, Germany. b University of Potsdam, Institute for Physics, Nonlinear Dynamics Group, Am Neuen Palais 10, Potsdam, Germany. 3.1 INTRODUCTION Complex dynamical systems are often characterized by a large number of nonlinearly interacting elements, giving rise to emergent properties that transcend the principle of linear superposition. In particular within the biological sciences, one of the primary challenges is to investigate how the collective behavior of cells, tissues or organisms can be understood in terms of the properties of their molecular constituents. To investigate this intricate connectivity of cellular systems, the analysis of complex networks has become an important part of molecular biology. A large number of biological phenomena and processes can be translated into the abstract concept of a complex network, making biological problems mathematically tractable. Prominent examples include the representation of transcriptional regulation as a network, where vertices represent genes or proteins and edges represent regulatory interactions, as well as cellular metabolism, where vertices represent metabolites and edges represent biochemical interconversions. However, beyond these rather straightforward examples, more abstract processes can sometimes also be translated into the language of complex networks. For example, different configurational states of a protein may be represented as vertices, with edges indicating transitions between them. Once a biological process or phenomenon is represented by a network, the tools of complex network theory allow for a systematic characterization of its structural properties. The analysis of network topology then seeks to uncover the functional organization, the underlying design principles and unknown organizing principles of cellular systems. Indeed, as realized only rather recently, many empirically derived complex networks, ranging from technological and sociological to biological examples, share common topological features. The organizing principles of empirical networks often reflect crucial system properties, such as robustness, redundancy or other functional interdependencies between network elements. A quantitative analysis of the large-scale characteristics of complex networks thus contributes to a i

ii 0 200 metabolite 400 600 800 0 200 400 600 800 metabolite Fig. 3.1 The substrate graph G S of the S.

2 ii metabolite metabolite Fig. 3.1 The substrate graph G S of the S. cerevisiae metabolic network [23], consisting of N V = 810 vertices (metabolites) and N E = 3419 edges. Directional information is omitted. Left: A visualization of the substrate graph using the freely available software package Pajek [9]. Right: A visualization of the adjacency matrix, with vertices (metabolites) ranked according to their degree. Each dot indicates whether the corresponding vertices (metabolites) are connected by an edge. The figures are adapted from [61]. better understanding of the organization of cellular functions and has already made significant impact on our current view of molecular biology. While not aiming at a comprehensive review, this chapter seeks to summarize and describe several basic measures and characteristics of network topology. The chapter is organized as follows: The main emphasis is placed on an overview of basic measures and indices that characterize the topology of networks, given within Section 3.2. In Section 3.3, several basic prototype models of complex networks are discussed. The subsequent Section 3.4 is devoted to a brief outline of global features of complex networks, such as hierarchies, modularity, attack tolerance and robustness. Finally, Section 3.5 provides notes on the statistical testing of network properties and describes several known pitfalls and possible misinterpretations in the statistical analysis of network properties. The working example throughout this chapter is a reconstructed version of the S. cerevisiae metabolic network [23], consisting of 810 metabolites and 843 reactions. The original bipartite graph was collapsed, such that two metabolites are connected if they participate in a common reaction. A graphical representation is shown in Fig GLOBAL PROPERTIES OF COMPLEX NETWORKS Following the nomenclature of Chapter 2, a network is formally represented by a graph G = (V, E), consisting of a set V of N V vertices and a set E of N E edges. We distinguish between undirected graphs, whose vertices are connected by edges without any directional information and directed graphs (digraphs), whose

3 GLOBAL PROPERTIES OF COMPLEX NETWORKS iii Fig. 3.2 Representations of complex networks. a) A directed network, consisting of N V = 7 vertices and N E = 13 directed edges. b) The adjacency matrix A of the network. c) The set of adjacency lists, specifying to which other vertices each vertex connects. d) The distance matrix D with elements d ij. Note that the distances are not symmetric and may be infinite, indicating that not all vertices can be reached from all other vertices. e) The input degree ki in and output degree ki out of each vertex. edges posses directional information. Additionally, in weighted graphs, each edge (directed or undirected) is associated with a scalar value, quantifying a possible interaction strength, a cost, or a flow on the respective edge. In most cases, a network is represented by its adjacency matrix A, with entries A ij = 1 indicating that there exists an edge between vertex n i and n j, and A ij = 0 otherwise. For undirected networks, the adjacency matrix is symmetric A ij = A ji. For weighted networks, the elements of the adjacency matrix are replaced by nonbinary scalar values. However, in particular for sparse networks, i.e. networks where the number of edges is much smaller than the number of possible edges N E NV 2, the adjacency matrix becomes computationally inefficient in terms of memory allocation. Alternatively, the network can be specified by a set of adjacency lists, consisting of N V lists that enumerate to which other vertices each vertex connects, see also Chapter 2. The adjacency matrix, as well as the adjacency lists, have their unique advantages and disadvantages in terms of computational efficiency. A schematic example of both representations is given in Fig. 3.2.

4 iv Distance, Average Pathlength and Diameter In a network consisting of N V vertices, the distance d ij between any two vertices n i and n j is given by the length of the shortest path between the vertices, i.e., the minimal number of edges that need to be transversed to travel from vertex n i to n j. The shortest path between two vertices does not have to be unique, often there exist several alternative paths with identical pathlength. For directed networks, the distance between two vertices n i to n j is usually not symmetric d ij d ji. Likewise, for directed, as well as disconnected networks, i.e., networks consisting of two or more isolated components, there might not always be a path that connects vertex n i to n j. In such a case, the distance between the respective vertices is infinite d ij =. See Fig. 3.2 for examples. The diameter d m = max(d ij ) of a network is defined as the maximal distance of any pair of vertices. The average or characteristic pathlength d = d ij of a network is defined as the average distance between all pairs of vertices. In the case of infinite distances, the average inverse pathlengthd eff = 1/d ij, also referred to as efficiency, can be used to specify the average pathlength within the network. In this case, a fully connected network d ij = 1 i, j has an efficiency d eff = 1, whereas large distances and disconnected components (using the limit 1/d ij = 0 for d ij = ) reduce the efficiency of the network. The situation is slightly less straightforward if weighted networks are considered. Then, we are faced with the possibility to take additional information into account. For example, within a network of train connections, the shortest pathlength (distance) between two stations can be defined according to physical distances, or, taking travel time into account, by the total time needed to travel from one station to another. Furthermore, the fastest connection must not always be the cheapest, thus we might wish to define the distance between two stations according to the amount of money needed to travel from one station to another. In either case, the term distance between vertices can be generalized to accommodate additional scalar information, given by a weight factor that is associated with each edge. Computationally, the estimation of the distance between two vertices is not trivial. Within the extensive literature on the shortest paths problem, the most common choices are the Dijkstra and the Floyd-Warshall algorithm [6]. The Dijkstra algorithm returns the lowest cost path between a source vertex n i and all other vertices in the network in O(NV 2 ) time. For efficiency reasons, the algorithm return just one shortest path, enumerating all shortest paths between two vertices is computationally more tricky and expensive. To calculate the all-to-all distances, the Floyd-Warshall algorithm is the method of choice. The algorithm returns the distance matrix in O(NV 3 ). Both algorithms straightforwardly allow to incorporate weighted edges. Negative weights may induce cycles that reduce the cost of a path each time the cycle is traversed. In this case, the definition of the lowest cost path has to be modified. Note that distances, pathlength and diameter also depend on network size and density (number of vertices and links) and are therefore no genuine classifiers that straightforwardly allow to compare different networks.

5 GLOBAL PROPERTIES OF COMPLEX NETWORKS v Six Degrees of Separation: Concepts of a Small World One of the striking properties of almost all empirical networks is that, despite their huge size of sometimes several millions of vertices, the average pathlength is usually surprisingly small. For example within cellular metabolism, represented by a network of metabolites (vertices) linked by biochemical reactions (edges), the average pathlength between two metabolites is only approximately d 3, independent of the specific organism [22, 33, 71]. A recent study of the World-Wide Web (WWW), represented by a network of web documents (vertices) that are connected by directed hyperlinks (URLs), estimated that the average pathlength between any two vertices is only d 16 [1], extrapolated for a network of 200 million documents. The term small world network itself originated in the social sciences, reflecting the assertion that within networks of social acquaintances (or friendships) all people (vertices) on the planet are separated from each other by just a small number of intermediate friends or acquaintances ( six degrees of separation, although the specific value six must not be taken too literally). However, strictly speaking, the term small-world is not a genuine network property, i.e. there is no measure or statistical test that allows to check whether a given specific empirical network belongs to the class of small world networks. As stated above, the average distance between vertices also depends on the size of the network: The more vertices a network has, the more distant the vertices tend to be. The small-world property is thus mainly understood to apply to network models whose average pathlength d increases slower or equal than the logarithm of the network size d log N V for N V. A further distinction includes ultra-small networks [13], whose average pathlength scales as d log log N V The Degree Distribution One of the most basic properties of a vertex n i is its degree k i, defined as the number of edges adjacent to the vertex. In a network without self-loops (edges that connect a vertex to itself) and multiple links (two vertices are connected by more than one edge) the degree equals the number of neighbors of the vertex. In the case of directed networks, we distinguish between the input degree k in i and the output degree k out i. Taking all vertices of a network into account, we can ask for the probability p(k) that the degree of a randomly chosen vertex equals k. The degree distribution p(k) has become one of the most prominent characteristics of network topology. One of the key discoveries that triggered the renewed interest in complex network theory was that the distribution p(k) of many empirical networks approximately follows a power law p(k) k γ, where γ denotes the degree exponent. In contrast to the until then prevailing picture, where vertices are connected randomly and each vertex has approximately the same number of links, many empirical networks are strongly inhomogeneous: While the vast majority of vertices only posses a small number of links, a small number of vertices ( hubs ) are highly connected. Examples of prototypical degree distributions are depicted in Fig Though being one of the most basics characteristics of network architecture, a statisti-

6 vi Fig. 3.3 Degree distributions of complex networks. a) A lattice-like network. Each vertex has the same degree k (for periodic boundary conditions or large networks, such that vertices at the border can be neglected). b) An Erdös-Rényi random network. The degree distribution is homogeneous, the degrees of the vertices are centered around the average value. c) A scale-free network. The degree distribution is highly inhomogeneous and follows a power law of the form p(k) k γ, where γ denotes the degree exponent. While most vertices only have a low number of connections, a smaller number of vertices is highly connected. cally stringent numerical estimation of the degree distribution is far from trivial [25]. In the simplest case, p(k) can be straightforwardly estimated from an (usually binned) histogram of degrees. However, for many real networks with strongly inhomogeneous degree distributions, the simple histogram approach provides insufficient statistics at high degree vertices and is a notorious source of misinterpretations [25]. More reliable in terms of numerical estimation is the cumulative degree distribution p c (k), defined as the probability that a randomly chosen vertex has a degree larger than k. The cumulative degree distribution p c (k) is a monotonously decreasing function of k and its estimation requires no binning. For a power-law distribution p(k) k γ, the cumulative degree distribution is of the form p(k) k (γ 1). An exponential distribution p(k) exp( k) corresponds to an invariant cumulative distribution p c (k) exp( k). Computationally even more straightforward is to rank the vertices according to their degree and plot the degree versus the rank of each vertex. Examples of different representations of the degree distribution are shown in Fig 3.4. It should be noted that all empirical networks necessarily show deviations from an strict mathematical degree distribution. In particular for power-law distributions, the size (number of vertices) of the network puts constraints on the estimation of the degree exponent. Highly connected vertices are rare, and their probability is thus difficult to estimate for small networks. Likewise, the number of vertices with small degree is restricted by network size. Consequently, the formula p k γ often only applies to an intermediate region of the empirical degree distribution and has to be adjusted with an exponential cut-off at high degrees. More importantly, as shown

7 GLOBAL PROPERTIES OF COMPLEX NETWORKS vii a) 10 3 b) 10 0 c) histogram cum. distribution p c (k) γ c = 1.29 degree k γ r degree k degree k node rank Fig. 3.4 Different representations of the degree distribution of the metabolite substrate network described in Fig 3.1. a) A binned histogram. Shown is the number of vertices with a degree k, using a logarithmic binning. b) The cumulative degree distribution p c(k), i.e, the probability that a vertex has a degree larger or equal k. Note that the cumulative distribution does not require binning, but is obtained from the (normalized) number of vertices with degree larger or equal k. c) The rank plot of metabolites, ranked according to their degree k. A power-law of the form p k γr in the rank plot corresponds to a degree exponent γ = γ r 2.3 in the original degree distribution p(k) and γ c 1.3 in the cumulative distribution. The straight lines are not fitted and only serve as a guide to the eye. in the recent literature, the reported degree exponent of many empirical networks correlates with network size and thus might not reflect the actual exponent of the underlying networks [18, 17]. Furthermore, for small degree exponents the variance of the degree distribution is infinite, thus any empirical sample of vertex degree is no typical observation. However, for many biological problem it is often more important to note that the degree distribution is highly inhomogeneous and long-tailed, as opposed to the question whether the degree distribution fits a power-law in a strict statistical sense. For weighted networks, the concept of degree can also be extended to account for the weights of the edges by defining the strength of a vertex as the sum of the absolute values of the weights Assortative Mixing and Degree Correlations Despite its importance in the topological characterization of complex networks, the degree distribution itself does provide only little information about the internal structure and organization of the network. More interesting is thus to look for correlations between the degrees of adjacent vertices. A network is called disassortative if vertices with high degree connect preferentially to vertices with low degree. Vice versa, a network is called assortative if vertices with high degree preferentially also connect to other vertices with high degree. As pointed out in the recent literature [49], social networks tend to be assortative, i.e. persons (vertices) with many friends (connections) tend to be also connected to other persons with many friends, while most technological and biological networks are disassortative.

8 viii Formally, the degree correlation can be obtained from the joint probability distribution p(k i, k j ) that two connected vertices n i and n j have degree k i and k j respectively. For uncorrelated degrees the joint probability is given by the product of the marginal degree distributions p(k i, k j ) = p(k i )p(k j ) A measure for the deviation from statistical independence is given by the mutual information [64, 66]. Unfortunately, a direct numerical estimation of p(k i, k j ) is computationally demanding and often not feasible due to the limited size of the (empirical) network (but see also [66] for the numerical estimation of probability distributions and a discussion of finite size effects). More straightforward is thus to consider the Pearson correlation coefficient between the degree of two adjacent vertices. The correlation coefficient or assortativity coefficient r lies in the range 1 r 1, with r < 0 corresponding to a disassortative network and r > 0 to an assortative network. Note that the assortativity coefficient r, similar to the usual Pearson correlation, has it limits for strongly inhomogeneous degree distributions and fails to correctly quantify nonlinear degree correlations, i.e. networks that are assortative for low degree vertices and disassortative for high degree vertices. Another popular, and closely related, measure to evaluate degree correlations is the average neighbor degree [53]. For each vertex n i the average degree k i,nn = 1 NV k i j=1 A ijk j of its neighbors is calculated. Subsequently, these values are averaged for all vertices having the same degree k, resulting in the average neighbor degree k nn (k). See Fig. 3.5 for examples of vertex degree correlations. To evaluate the degree correlations for weighted and directed networks requires slight modifications in the respective definitions. In the case of directed networks, two distinct correlation indices are most interesting: (i) Do the in-degrees ki in of vertices correlate with their neighbors out-degrees ki out, and (ii) do the out-degrees ki out? In the case of weighted of vertices correlate with their neighbors in-degrees ki in networks, the degrees can again be replaced by their weighted counterparts The Clustering Coefficient Another basic measure that accounts for the internal structure of a network is the clustering coefficient C. The clustering coefficient relates to the local cohesiveness of a network and measures the probability that two vertices with a common neighbor are connected. In the case of undirected networks,given a vertex n i with k i neighbors, there existe max = k i (k i 1)/2 possible edges between the neighbors. The clustering coefficient C i of the vertex n i is then given as the ratio of the actual number of edges E i between the neighbors to the maximal number E max, C i = 2E i k i (k i 1). (3.1) See Fig. 3.6 for a schematic example. Note that, strictly speaking, the clustering coefficient C i is not a property of the vertex n i itself, but rather a property of its neighbors. The global or mean clustering coefficient C = C i of the network is the average cluster coefficient of all vertices.

9 GLOBAL PROPERTIES OF COMPLEX NETWORKS ix average neighbor degree node degree k clustering coefficient C node degree k Fig. 3.5 Vertex degree correlation in the substrate graph. Left: The average neighbor degree k i,nn of each vertex n i, plotted versus the degree k i. The solid line gives the (binned) average over all vertices with the same degree k. For large degrees a weak negative correlation is observed. Right: The clustering coefficient C i of each vertex versus the degree k i. Highly connected vertices exhibit a low clustering coefficient, i.e., highly connected vertices preferentially connect to vertices that are not mutually connected, indicating a hierarchical structure. Many empirical networks exhibit a rather high clustering coefficient, indicating a local cohesiveness and a tendency of vertices to form clusters or groups. Indeed, for example in social networks, it seems intuitive that two persons (vertices) who have a common friend are much more likely to be also friends, as compared to two randomly chosen persons. Interestingly, this also directly relates to the notion of degree correlations and dynamics on networks. As persons that share a common friend are likely to become acquainted themselves, they will acquire new friends over time. In particular, a highly connected person will induce new connections among his friends (neighboring vertices). In this sense, within social networks, a situation with disassortative degree correlations and low clustering coefficients is dynamically unstable and must be expected to evolve gradually towards more clustering and thus assortative degree correlations. However, despite its conceptual simplicity, the interpretation and statistical testing of the clustering coefficient holds some pitfalls, which are discussed in more detail in the Section 3.5. Furthermore, the clustering coefficient depends on the number of edges within the network. To claim a nontrivial local clustering within the network, an estimated value of C thus has to be compared to an appropriate null model to validate whether the value is indeed statistically significant, i.e., whether the respective network indeed exhibits a higher degree of clustering than a corresponding random network. Difficulties also arise for specific types of graphs, such as bipartite graphs, that exhibit a nontrivial clustering coefficient inherent to the bipartite structure [1, 52], see Section 3.5 for a detailed discussion. An alternative, but equivalent, definition of C can be given with respect to the number of triads (triples of vertices where each vertex is connected to both others) within a network. Note that the number of edges between the neighbors of a vertex is equal to the number of triads that vertex is part of. The global clustering coefficient is then

10 x Fig. 3.6 The clustering coefficient relates to the local cohesiveness of a network. a) The clustering coefficient is defined as the probability that two vertices with a common neighbor are connected. b) A highly connected vertex with a low clustering coefficient, indicating a (at least locally) hierarchical structure. c) A a vertex with high clustering coefficient C vertex = 0.8 defined as the proportion of triads in a network with respect to the total number of connected triples (triples where at least one vertex is connected to both others). C = 3 number of triads number of connected triples (3.2) The factor 3 accounts for the fact that each triad contributes to 3 connected triples [1]. A characterization of the clustering coefficient with respect to the number of triads holds some advantages with respect to numerical estimation and can be generalized to other structures, such as the number of squares [31]. Of particular interest is also the correlation of the clustering coefficient C i with other properties of a vertex n i. For example, as described by Newman [49], many empirical networks exhibit a negative correlation between the degrees k i and the clustering coefficients C i, indicating a modular structure of the network. See Fig. 3.5 for an example The Matching Index Within many empirical networks, two vertices that are functionally similar do not necessarily have to be connected. For example, within a network of protein interactions, two proteins that are involved in the regulation of similar processes and should be considered as closely related, must not necessarily bind to each other. Correspondingly, the normalized matching index M ij quantifies the similarity between two vertices based on the number of common neighbors shared by two vertices n i and n j. M ij = common neighbors total number of neighbors = N k,l A ika jl k i + k j N k,l A (3.3) ika jl Note that for the measure to be properly normalized, the denominator only counts the number of distinct neighbors, i.e. neighbors that are shared by both vertices are only counted once. One of the virtues of the matching index is that it can

11 GLOBAL PROPERTIES OF COMPLEX NETWORKS xi Fig. 3.7 Vertices that are functionally related do not necessarily have to be connected. The matching index counts number of common neighbors shared by two vertices, normalized by the total number of distinct neighbors. The right panel shows the adjacency list of the vertices n 1 and n 2, along with the corresponding matching index M 12. be straightforwardly applied to networks consisting of different types of vertices, such as bipartite graphs. For example, two transcription factors may regulate the expression of similar genes, without necessarily regulating (or binding to) each other. A schematic illustration of the matching index is given in Fig 3.7. The matching index can be generalized beyond the immediate neighbors of a vertex or extended to multiple vertices [40]. Furthermore, at the most general level, two vertices can be regarded (or defined) as similar if their distance to all other vertices within the network is approximately the same, irrespective of whether they are directly connected or not [75]. An advantage of this definition lies in the fact that the actual pair-wise similarity of two vertices must not be specified. The definition only draws upon the notion that two entities (vertices) must be considered similar, if they perceive the rest of the world (here the distance to all other vertices within the network) in a similar way Network Centralities Closely related to distance measures, network centrality indices seek to characterize each vertex or edge with respect to their position within the network. Centrality measures will be discussed in more detail in Chapter 4 of this book, here we will only briefly outline some basic features. Intuitively, a basic measure of the importance of a vertex n i is its degree k i (degree centrality). And indeed, several studies on biological network report a significant relationship between vertex degree and functional importance of vertices [2]. For example, within protein interaction networks, the removal of highly connected proteins is more likely to have lethal effects than removal of proteins with only a small number of links [32]. However, the degree is clearly not the only determinant of the functional importance of a vertex. Often more relevant, is the contextual location of the vertex within the network. For example, we can ask from which vertex a signal should be sent to reach all other vertices in minimal time. Or, vice versa, which vertices can be reached fastest from any other vertex within the network? In this respect, the closeness centrality specifies which vertices have the shortest paths to all

12 xii Fig. 3.8 The degree of a vertex does not necessarily reflect importance with respect to function of a network. While vertex n 1 has a high degree, its removal does not necessarily affect communication within the network. However, removal of vertices with low degree may have significant effects on communication or mass flow within the network, as seen for vertex n 2. others, measured for example by the (inverse of the) average distance from a vertex to all other vertices. For detailed definitions see Chapter 4 in this book. Probably the most well-known centrality measure is the betweenness centrality (BC). The betweenness centrality can be defined with respect to vertices and edges, and measures how often a vertex or edge is present in the set of all shortest paths As can be seen in Fig. 3.8, low degree vertices can be crucial to establish communication or mass flow within a network. Thus, with respect to robustness properties of a network, a selective attack on vertices with high BC was often found to be more relevant than a removal of vertices with high degree. Computationally, the estimation of the betweenness centrality is rather demanding and described in Chapter 4 of this book Eigenvalues and Spectral Properties of Networks An important property of network topology are the spectral properties of the adjacency matrix A. Though as yet only hardly used in biological research, the spectra of random graphs are among the oldest characteristics of network topology with a plethora of applications in many branches of physics [1]. For an undirected graph, the symmetric adjacency matrix A has N V real eigenvalues λ i. The spectral density ρ(λ), ρ(λ) = 1 N V N V δ(λ λ i ), (3.4) i=1 approaches a continuous function for increasing network size N V. An extensive amount of work about the mathematical properties of the spectral density is available, including the famous Wigner semicircle law [21, 1]. Of more relevance to the biological sciences, the eigenvalues of network matrices are becoming increasingly important with respect to two different fields of research: First, in networks of coupled oscillators, i.e, in networks where each vertex corre-

13 MODELS OF COMPLEX NETWORKS xiii sponds to an oscillator coupled to other oscillators via an adjacency matrix, the global dynamics of the system are determined by the structure of the adjacency matrix. In particular, the stability of the synchronized state, i.e., the state of the network where almost all vertices oscillate synchronously, can be related to the eigenvalues of the Laplacian matrix of the network [54], defined in close analogy to the adjacency matrix. Recent studies also take into account the effect of weighted edges [74]. Second, along similar lines, the eigenvalues of network matrices determine the stability and local dynamics of networks composed of interacting elements. For example, the vertices of a metabolic network denote metabolites, whose concentrations change according to the adjacent edges (metabolic reactions). Formally this system is represented by a differential equation for all metabolite concentrations. However, at least locally, this (usually unknown) system of differential equations can be approximated by a weighted interaction matrix, denoted as the Jacobian J of the system. The Jacobian matrix already governs essential aspects of the dynamics and predicts specific dynamic behavior even if detailed knowledge about the underlying reactions and interactions is not available [63, 65, 68]. 3.3 MODELS OF COMPLEX NETWORKS The various network indices discussed until now characterize and quantify the topological structure of a given network. However, to understand and elucidate whether an estimated value indeed corresponds to nontrivial structure within the network requires to consider basic prototype models of complex networks. We emphasize that none of the models described below aims to mimic the detailed features of any real network. Rather they represent minimal models, each invented to exhibit distinct generic features of complex networks. The purpose of prototype models is twofold: First, they provide null models to understand whether an observed feature is a generic feature of certain network classes or whether it deviates from what could be expected for a simplistic model. Second, prototype models often provide insight on how certain features of complex networks arise from the construction rules of the prototype models, allowing to probe to what extent (for example evolutionary) mechanisms can account for the observed features of empirical networks. Again, more detailed mathematical treatises on random network models are given elsewhere [1, 17, 50], here we only outline the basic ideas The Erdös-Rényi Model Probably the most basic model of a random network is given by the Erdös-Rényi (ER) network [20]. The ER network consists of N V vertices, connected by N E (undirected) edges which are chosen randomly from the set of N V (N V 1)/2 possible edges (excluding multiple connections and links from a vertex to itself). The 2N E N V (N V 1). probability p that two randomly chosen vertices are connected is thus p = Alternatively, the ER model can be defined as a set of N V vertices, with each pair of vertices connected with an equal probability p 1. The number of edges N E is then

14 xiv a random variable, with the expectation value N E = pn V (N V 1)/2 [1]. The ER model has been the primary subject of random graph theory, resulting in extensive knowledge about its mathematical properties and typical features. Here we only summarize some basic properties. The degree distribution of the ER model is given by a binomial distribution that becomes approximately Poissonian in the limit of large networks (N V ). The probability of a vertex to have degree k is k k k p(k) e k! (3.5) with k = pn V denoting the average degree. A typical realization of the ER model is rather homogeneous, most vertices have a similar degree, distributed approximately symmetrically around the average degree k, as shown in Figure 3.3b. Most analytical work on the ER model has concentrated on questions related to percolation theory, i.e, the connectednes of the network and the emergence of paths that enable a traversal of the whole network. For small p the network is disconnected and consists of a large number of isolated components [50]. At p 1/N V (thus for average degree k 1) a phase transition occurs, giving rise to a giant-component that encompassed most of the vertices of the network. For p log(n V )/N V all vertices are connected for almost all realizations of the random network. The ER model exhibits the small-world property. Above the percolation threshold, the average pathlength is very small and scales as the logarithm of the number of vertices l log N V (with k kept constant for increasing number of vertices). By construction, the clustering coefficient of the ER network C = p = k /N V, i.e. the probability that two vertices with a common neighbor are connected equals the probability that any pair of randomly chosen vertices are connected. The ER model does not show any local cohesiveness. Likewise, the degree of connected vertices is uncorrelated, the ER model does not display degree correlations. The Erdös-Rényi model remains one of the most important prototype models in graph theory. However, the main limitations for a direct comparison of network properties with empirical networks are its homogeneous degree distribution, the absence of local structure and the lack of degree correlations. A close variant of the ER model, the configuration model, will be discussed in Section The Watts-Strogatz Model While the Erdös-Rényi model correctly reproduces the small-world property, it fails to account for the local clustering that characterizes many empirical networks. In particular for social networks, i.e. networks of mutual friendships or acquaintances, most studies indicate a clustering coefficient that is orders of magnitude higher than the value obtained for a corresponding ER network. In one of the seminal papers of complex network theory, Watts and Strogatz proposed a model for coexistence of local structure on the one hand, and a small average pathlength on the other hand [72]. The starting point of the model is the limiting case of

15 MODELS OF COMPLEX NETWORKS xv Fig. 3.9 The Watts-Strogatz model: Starting point is a regular network, constructed such that each vertex is connected to its two nearest neighbors, resulting in a maximal clustering coefficient C = 1. With probability p rew links are randomly rewired. In the limit p rew 1 the ER model is recovered. a regular lattice-like network: Each vertex (arranged on an one-dimensional ring in the original model) is connected to its n/2 nearest neighbors. In social terms, this would resemble a strictly local medieval-like world, where each person only knows people in his or her immediate vicinity, such as neighbors and people in nearby villages. Consequently, the model exhibits strong local cohesiveness (a high clustering coefficient), but the spread of information is slow, i.e. the average pathlength scales linearly with system size. Extending the regular lattice-like network, shortcuts between distant vertices are introduced, i.e. with a probability p rew a link is rewired, such that one end is detached from its original vertex and connected to a randomly chosen vertex. In social terms, this would correspond to a merchant or traveler, who is also acquainted to a small number of more distant people within the country. As the probability p rew increases and more links are rewired, the model approaches a random network of the ER type. In the limit p rew 1 the ER model is recovered. The network thus again exhibits no local structure (small clustering coefficient) and the average pathlength scales as the logarithm of network size. One of the intriguing result of the WS model is that already a very small number of shortcuts (p rew 1) is sufficient to rapidly decrease the average pathlength [28]. On the other hand, for small p rew, the local clustering remains almost unaffected and the clustering coefficient only decreased significantly for p rew 1. Thus for an intermediate region of p rew, the WS model exhibits a coexistence of high local clustering and short average pathlength (small-world property), as also observed in many empirical networks. A schematic representation of the WS model is given in

16 xvi Fig The Barabási-Albert model [7]: Starting with an initial small network, consisting of N 0 unconnected vertices, a new vertex is introduced at each timestep and connected with m < N 0 edges (here shown with m = 2). Fig The main significance of the WS model results from the fact that it emphasizes a difference between local and global properties of networks. The clustering coefficient, a local property, is determined by the immediate neighborhood of a vertex and is almost unaffected by the introduction of additional shortcuts within the network. On the other hand, the average pathlength, a global property, rapidly decreases upon the introduction of just a few shortcuts. This has, for example, profound implications on the spread of infectious diseases across continents. A change in average pathlength to distant vertices is not detectable at the local level, i.e., your social neighborhood might remain almost unaltered, while the distance (in network terms) to infected persons can rapidly decrease with only a small number of transcontinental travellers. However, apart from the coexistence of high local clustering and short average pathlength, the WS models captures almost no other feature found in empirical networks. Its importance as a null model for biological networks thus remains limited The Barabási-Albert Model Among the most important limitations of the models discussed above is that neither captures or accounts for the inhomogeneous degree distribution found in many empirical networks. To this end, Barabási and Albert [7] proposed a simple network model that gives rise to a scale-free degree distribution and still provides the conceptual basis for most current network models described in the literature. Closely related to (and actually a simplification of) an earlier model by Price [4, 16, 45, 50], the BA model is based on two essential ingredients: i) Growth: In contrast to the models discussed above, the BA model does not assume that the number of vertices within the network is fixed. Mimicking the dynamics of many real networks, vertices are continuously added and the network grows as a function of time. ii) Preferential attachment: New edges are not introduced randomly, but the probability that a vertex receives a new edge depends on its present degree k i, again reflecting dynamic properties of real networks. The growth process is organized as follows: Starting with an initial small network, consisting of N 0 unconnected vertices, a new vertex is introduced at each timestep. The new vertex is connected to with m N 0 edges to the already present vertices. The probability ρ(n i ) that analready present vertex n i receives a new edge is

17 MODELS OF COMPLEX NETWORKS xvii proportional to its degree k i : ρ(n i ) = k i j k j. (3.6) After t timesteps, the network consists of N V (t) = N 0 + t vertices, connected by N E (t) = mt edges. Due to the preferential attachment mechanism, older vertices tend to have accumulated more links, and thus have an even higher probability to receive yet more links (a rich-get-richer dynamics). Likewise, new vertices only have a small number of links, and thus a low probability of receiving additional links. A schematic illustration of the growth process is given in Fig In the long time limit t 1, the BA model exhibits a scale-free degree distribution p(k) k γba with a degree exponent γ BA = 3 that is invariant with time. The degree exponent is independent of the free parameters m and N 0. The BA model captures the small world property, BA networks are found to have shorter average pathlength than ER and WS models of the same size and density. The degrees are uncorrelated and analytical estimates of the clustering coefficient are available [36]. One of the merits of the BA model is that it provides a possible mechanism to explain the observed scale-free distribution of many empirical networks. Indeed, the time evolution of many empirical networks is governed by preferential attachment-like processes. For example, within a social network, people (vertices) with already many friends (edges) are more likely to acquire new friends, as compared to people with few edges. Likewise, already famous actors will obtain more offers to act in a new movie than young unknown actors. Scientific papers that are already frequently cited are more likely to be read and cited again than less frequently cited papers. Importantly, the preferential attachment rule also provides several testable predictions for complex networks. For example, if metabolic networks are reported as scale-free, then, according to this growth rule, highly connected metabolites should have an early evolutionary origin. Indeed, as emphasized by Wagner and Fell [71], many of the highly connected metabolites, mainly intermediates of the TCA cycle and glycolysis, as well as some ubiquitous co-factors, are among the evolutionary oldest. However, explanations in terms of evolutionary mechanisms also hold some pitfalls which are unfortunately rarely if ever discussed in the literature. Most importantly, not only the formation of a network itself, but often also the acquisition of data about the network is governed by similar mechanisms. For example, minor movies with famous actors are more likely to be included in the respective databases than local movies starring only unknown actors. Likewise, putative biochemical regulations or reactions adjacent to the TCA cycle are more likely to be investigated, and thus reported in publications, than putative regulations within the outskirts of metabolism. In this sense, an observed feature of an empirical network might also always reflect properties of the data acquisition process, rather than genuine properties of the network itself.

18 xviii Extensions of the BA Model The BA model constitutes the conceptual basis for a large variety of extensions and modifications and has triggered an exceptional amount of further work in the complex network models. Most extensions can roughly be subdivided into two (though often overlapping) categories: i) Modifications that aim to generate networks with specific tunable features, such as different degree exponents, tunable cluster coefficients or degree correlations. ii) Modifications that aim to mimic the evolutionary growth processes of specific networks in more detail, such as aging in social networks or capacity restrictions in transportation networks. For example, the exponential cutoff at high degrees observed in many real networks can be accounted for by aging of vertices, i.e., vertices that have been present for a given time T stop acquiring new edges or are removed from the network [3] as could be expected in social networks. Similar, an airport within a transportation network will not acquire new connections beyond a certain capacity, again resulting in an exponential cutoff at high degrees. Other processes to modify the properties of the network include re-wiring of edges according to defined rules. For example, within a social network people that have a common friend are more likely to become acquainted themselves, resulting in an increased local clustering of the network [15]. Further extensions and modifications include memory effects and high clustering [36], degree correlations [56, 73], tunable degree exponents [38], information accessibility [48], among many more. An overview of early modifications and extensions of the original BA model can also be found in Table III of [1]. 3.4 ADDITIONAL PROPERTIES OF COMPLEX NETWORKS Within the first section, most emphasis was placed on quantitative measures that describe the properties of individual vertices and edges. However, complex networks are also characterized by emergent features that transcend the properties of individual vertices and relate to the organization of the network as a whole. In the following the basic emergent global properties of complex networks, such as robustness or modularity, are outlined Structural Robustness and Attack Tolerance Most biological systems share a common feature: robustness [8, 35, 60, 69]. Constituting one of the fundamental organizing principles of biology, cellular networks must be able to maintain their function in the face of constant perturbations and fluctuations that affect the internal or external parameters of the system. In the context of complex network analysis, robustness is mainly understood as the persistence of topological network properties, such as average pathlength or connectedness, upon removal of vertices or links [1, 2, 10, 33]. Indeed, most empirical networks show a surprising tolerance against removal of vertices. Focusing on topological aspects of robustness only, a number of studies

19 ADDITIONAL PROPERTIES OF COMPLEX NETWORKS xix Fig The robust, yet fragile nature of scale-free networks. Properties of scale-free networks are highly robust against random removal of vertices, but vulnerable against selective intentional removal of vertices. revealed significant differences between distinct network topologies upon removal of vertices or edges [2, 29]. In general, we have to distinguish between random and intentional attacks on network topology. While for ER networks, due to the homogeneity of vertex properties, the response to random and intentional attacks is roughly similar, the situation for scale-free networks is markedly different. Most properties of scale-free networks were found to be exceptionally robust against random removal of vertices. However, at the same time, scale-free networks are vulnerable with respect to intentional attacks. This difference is due to the heterogeneous degree distribution. Low degree vertices are far more frequent than high degree vertices, but only play a minor role in overall network topology. While random attacks will most likely affect low degree vertices, a selective attack on high degree vertices has far more dramatic consequence on global network indices [2, 1, 29]. A schematic illustration is given in Fig In general, the difference between random and intentional attacks is at the core of most current research on network robustness. The robust, yet fragile nature of complex systems refers to the fact that many complex systems are robust against random attacks, while they remain fragile against selective attacks. In particular highly optimized systems are extremely robust against anticipated attacks, while optimization concomitantly leads to vulnerability against unanticipated perturbations, related to the principle of highly optimized tolerance (HOT) [12]. It should be noted though, that the restriction to topological aspects of robustness only allows for a rather restricted view on network robustness. Dynamic aspects of functional robustness thus receive increasing interest recently [35, 47, 60, 63, 65] Modularity, Community Structures and Hierarchies Related to the idea of functional robustness is the notion of modules and community structures within complex networks. In general, it is assumed that many complex networks are built up from (interacting and possibly overlapping) modules or communitites. The detection of such community structures has attracted substantial interest recently and defines an important aspect of complex network analysis [57, 75, 76].

Properties of Biological Networks

Properties of Biological Networks presented by: Ola Hamud June 12, 2013 Supervisor: Prof. Ron Pinter Based on: NETWORK BIOLOGY: UNDERSTANDING THE CELL S FUNCTIONAL ORGANIZATION By Albert-László Barabási